[00:00:04] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200820T0000). [00:10:30] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [00:19:40] dbtree works for me. expecting recovery. [00:36:06] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 21682008 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:38:06] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 412472 and 109 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:10] PROBLEM - Check systemd state on ms-be1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:22] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1039 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:51:08] !log ms-be1039 - started failed ferm service [00:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:02] RECOVERY - Check systemd state on ms-be1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:50] off [01:19:18] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1039 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:21:22] (03Abandoned) 10DannyS712: More cleanup of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582879 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [01:58:50] mutante: loading dbtree takes about 15 secs here, maybe there's some slow backup? [02:21:30] 13.37s here [02:43:24] 10Operations, 10LDAP-Access-Requests: LDAP access to wmf group for CLo (WMF) - https://phabricator.wikimedia.org/T260866 (10cwylo) [02:47:22] 10Operations, 10LDAP-Access-Requests: LDAP access to wmf group for CLo (WMF) - https://phabricator.wikimedia.org/T260866 (10Peachey88) [02:48:26] 10Operations, 10LDAP-Access-Requests: LDAP access to wmf group for CLo (WMF) - https://phabricator.wikimedia.org/T260866 (10Peachey88) @cwylo I've updated the request with the standard request template, Could you please update the purpose reason and if you already have shell access please. [06:01:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy::envoy: completely overwrite X-Forwarded-Proto if needed [puppet] - 10https://gerrit.wikimedia.org/r/621283 (owner: 10Giuseppe Lavagetto) [06:25:35] (03PS1) 10Giuseppe Lavagetto: Revert "Revert "profile::service_proxy::envoy: inject XFP in all calls to mediawiki"" [puppet] - 10https://gerrit.wikimedia.org/r/621244 [06:27:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Revert "profile::service_proxy::envoy: inject XFP in all calls to mediawiki"" [puppet] - 10https://gerrit.wikimedia.org/r/621244 (owner: 10Giuseppe Lavagetto) [06:40:37] (03PS16) 10Ryan Kemper: elasticsearch: verify all write queues are empty [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 [06:41:20] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1001/24592/ it seems to do the right thing." [puppet] - 10https://gerrit.wikimedia.org/r/621200 (owner: 10Giuseppe Lavagetto) [06:42:55] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: verify all write queues are empty [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [06:43:20] (03CR) 10Ryan Kemper: "Okay, assuming there's not any glaring errors, I think all that's left is the unit tests (getting the unit tests working should fix the tw" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [06:43:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::ores: enable the service proxy to the MediaWiki api [puppet] - 10https://gerrit.wikimedia.org/r/621200 (owner: 10Giuseppe Lavagetto) [06:46:56] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Ladsgroup) My 2c, I would really appreciate. having some sort of support for usecases like {T240884} while we are building something lik... [06:55:46] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/619858 (https://phabricator.wikimedia.org/T259684) (owner: 10Jeena Huneidi) [07:02:21] I am checking dbtree timeouts [07:06:57] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) From what I understand, the service will run inside the sandbox, and eval() can be done in this service with relative safety.... [07:24:22] RECOVERY - WDQS high update lag on wdqs1007 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.157e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:25:30] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (dbprov1002), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:25:38] hello [07:25:54] I am going to restart CI/Zuul for an upgrade [07:28:58] !log hashar@deploy1001 Started deploy [zuul/deploy@5989ed0]: Upgrade gear from 0.7.0 to 1.15.1+wmf1 - T258630 [07:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:02] T258630: Improve scheduling of CI jobs invoked by zuul - https://phabricator.wikimedia.org/T258630 [07:29:11] !log hashar@deploy1001 Finished deploy [zuul/deploy@5989ed0]: Upgrade gear from 0.7.0 to 1.15.1+wmf1 - T258630 (duration: 00m 13s) [07:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:55] !log contint1001: restarted zuul-merger [07:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:58] (03PS1) 10Filippo Giunchedi: admin: add cwylo to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/621470 (https://phabricator.wikimedia.org/T260866) [07:35:29] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmf group for CLo (WMF) - https://phabricator.wikimedia.org/T260866 (10fgiunchedi) >>! In T260866#6398754, @Peachey88 wrote: > @cwylo I've updated the request with the standard request template, Could you please update the purpose reaso... [07:37:02] looking for kind souls for an easy +1 on https://gerrit.wikimedia.org/r/621470 [07:39:02] !log contint2001: restarted zuul [07:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:14] 10Operations, 10netops: Make eqord its own AS - https://phabricator.wikimedia.org/T259593 (10ayounsi) To clarify export/import policies: From eqiad and codfw we export all WMF prefixes to eqord (and no DMZ), and apply the RPKI rules to the prefixes imported from eqord. From ulsfo we export only the POPs prefix... [07:39:41] (03CR) 10Kormat: [C: 03+1] admin: add cwylo to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/621470 (https://phabricator.wikimedia.org/T260866) (owner: 10Filippo Giunchedi) [07:39:56] godog: seeing as you excluded me by the criteria, i've +1'd it out of spite :P [07:40:04] (03CR) 10Filippo Giunchedi: prometheus: use aggs to consolidate mediawiki logging metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621098 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [07:40:19] (03CR) 10Hashar: [C: 03+1] "Seems to match :]" [puppet] - 10https://gerrit.wikimedia.org/r/621470 (https://phabricator.wikimedia.org/T260866) (owner: 10Filippo Giunchedi) [07:41:59] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621318 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [07:42:11] kormat: haha! well played [07:42:32] kormat: I haven't excluded you from https://gerrit.wikimedia.org/r/c/operations/puppet/+/621302 though if you'd like to take a look [07:43:49] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add cwylo to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/621470 (https://phabricator.wikimedia.org/T260866) (owner: 10Filippo Giunchedi) [07:43:58] (thanks all) [07:44:38] ;] [07:45:46] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmf group for CLo (WMF) - https://phabricator.wikimedia.org/T260866 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi @cwylo you are now in `wmf` LDAP group, resolving. Please reopen if something is amiss [07:49:22] (03CR) 10JMeybohm: [C: 03+1] Enable TLS for fluent-bit -> eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/621332 (https://phabricator.wikimedia.org/T260626) (owner: 10Ppchelko) [08:07:41] !log disable transit/peering on cr3-knams - T259621 [08:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:20] (03PS3) 10Filippo Giunchedi: hieradata: disable panel html sanitization for grafana-labs [puppet] - 10https://gerrit.wikimedia.org/r/621197 (https://phabricator.wikimedia.org/T259143) [08:08:22] (03PS1) 10Filippo Giunchedi: grafana: remove deprecated settings [puppet] - 10https://gerrit.wikimedia.org/r/621472 (https://phabricator.wikimedia.org/T259143) [08:21:01] !log reboot cr3-knams for upgrade - T259621 [08:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:06] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:25:35] expected ^ [08:25:38] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:25:40] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:26:00] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:26:00] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:26:02] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:27:38] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:27:55] that's a good sign [08:27:56] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:27:56] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:28:00] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:29:00] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:29:30] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:31:06] all good [08:31:57] !log enable transit/peering on cr3-knams - T259621 [08:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:00] (03PS2) 10Ayounsi: Depool codfw for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/619969 (https://phabricator.wikimedia.org/T259621) [08:40:54] (03CR) 10Ayounsi: [C: 03+2] Depool codfw for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/619969 (https://phabricator.wikimedia.org/T259621) (owner: 10Ayounsi) [08:41:29] !log depool codfw for routers upgrade - T259621 [08:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:06] (03CR) 10Kormat: [C: 03+1] "AFAICT looks good :)" [puppet] - 10https://gerrit.wikimedia.org/r/621302 (owner: 10Filippo Giunchedi) [08:44:52] !log running analyze table on db1115's tendril.global_status_log, may case some stalls on tendril/dbtree T260876 [08:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:55] T260876: dbtree slowdown 2020-08-20 - https://phabricator.wikimedia.org/T260876 [08:48:57] !log bump cr2-codfw OSPF metrics - T259621 [08:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:30] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 91424 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [08:52:33] !log disable transit/peering on cr2-codfw - T259621 [08:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:37] !log removing /usr/bin/check_mariadb.py from all db hosts T259516 [08:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:41] T259516: DBA python layout - https://phabricator.wikimedia.org/T259516 [08:53:51] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Joe) Hi, sorry for not adding some comments earlier, I was busy with the aftermath of an UBN! task. Let me list some of the characteris... [08:55:55] 10Operations: expired/missing gpg keys from reprepro external repos - https://phabricator.wikimedia.org/T260883 (10fgiunchedi) [08:58:34] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [09:01:51] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) Restored the data from backups, machine has caught up on replication with no further failures. No new errors in the hw logs. Going to repool it now. [09:02:55] (03PS1) 10Filippo Giunchedi: aptrepo: use gpg long fingerprints everywhere [puppet] - 10https://gerrit.wikimedia.org/r/621476 (https://phabricator.wikimedia.org/T260883) [09:03:14] !log kormat@cumin1001 dbctl commit (dc=all): 'Repool db2125 after host failure T260670', diff saved to https://phabricator.wikimedia.org/P12303 and previous config saved to /var/cache/conftool/dbconfig/20200820-090313-kormat.json [09:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:18] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [09:06:10] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10Joe) 05Open→03Resolved Reporting here in brief: * We confirmed the problem had to do with activating firejail for all... [09:06:53] also looking for volunteers for https://gerrit.wikimedia.org/r/c/operations/puppet/+/621476 [09:07:02] _joe_: any chance of public subtask [09:07:05] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10jijiki) [09:07:55] <_joe_> RhinosF1: what info are you missing? [09:08:47] _joe_: just like to be nosey if I'm honest and being with Miraheze we're always challenged on the minimum possible being private. [09:08:55] !log reboot cr2-codfw:re1 (backup) for upgrade - T259621 [09:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:08] <_joe_> RhinosF1: I do agree with "minimum possible being private" [09:09:27] <_joe_> and I'm unsure if I'd expose other installations to risks if I just made that task public tbh [09:10:01] <_joe_> more or less all of the info that is relevant is in my last comment on the main task [09:10:44] _joe_: I don't doubt it, don't worry if it can't be. I've asked the right people on our end to read your closing comment. My job is mainly just to help keep MediaWiki up to date and run maint scripts. [09:10:50] (03CR) 10Kormat: [C: 03+1] aptrepo: use gpg long fingerprints everywhere [puppet] - 10https://gerrit.wikimedia.org/r/621476 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [09:10:52] <_joe_> RhinosF1: I'll discuss with others if there is a risk making it public though [09:10:59] Thanks [09:12:06] (03CR) 10Gehel: elasticsearch: verify all write queues are empty (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [09:13:45] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: use gpg long fingerprints everywhere [puppet] - 10https://gerrit.wikimedia.org/r/621476 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [09:18:01] !log stress-testing db2125 T260670 [09:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:06] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [09:18:09] !log cr2-codfw> request chassis routing-engine master switch - T259621 [09:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:15] that's the impactful one [09:26:02] smoooooooth [09:27:10] now re0 [09:33:12] !log reboot cr2-codfw:re0 (backup) for upgrade - T259621 [09:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:45] (03CR) 10Gehel: [C: 04-1] "See comments inline." (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [09:41:30] !log cr2-codfw> request chassis routing-engine master switch - T259621 [09:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:52] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:46:18] expected [09:46:21] 10Operations, 10conftool, 10serviceops: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Joe) [09:46:31] 10Operations, 10conftool, 10serviceops: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Joe) p:05Triage→03High [09:47:44] (03CR) 10JMeybohm: "> Patch Set 5:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [09:47:50] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:51:38] !log enable transit/peering and re-set normal OSPF values on cr2-codfw - T259621 [09:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:03] 10Operations, 10Traffic, 10conftool, 10serviceops: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Joe) Adding traffic as their systems are the ones affected. [09:57:29] !log bump cr1-codfw OSPF metrics - T259621 [09:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:01] (03PS1) 10Lucas Werkmeister (WMDE): Don't try to load source maps in production [extensions/Wikibase] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621488 (https://phabricator.wikimedia.org/T260852) [09:59:30] (03CR) 10Lucas Werkmeister (WMDE): "Scheduled for EU backport window in one hour." [extensions/Wikibase] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621488 (https://phabricator.wikimedia.org/T260852) (owner: 10Lucas Werkmeister (WMDE)) [10:00:04] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200820T1000). [10:01:11] (03CR) 10Michael Große: [C: 03+1] Don't try to load source maps in production [extensions/Wikibase] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621488 (https://phabricator.wikimedia.org/T260852) (owner: 10Lucas Werkmeister (WMDE)) [10:07:14] 08Warning Alert for device cr2-eqsin.wikimedia.org - Traffic on tunnel link [10:07:30] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:12:55] !log reboot cr1-codfw:re1 (backup) for upgrade - T259621 [10:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:00] (03PS1) 10Mvolz: Update Zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/621482 (https://phabricator.wikimedia.org/T255176) [10:18:56] (03PS2) 10Mvolz: Update Zotero to b0a30f98c [deployment-charts] - 10https://gerrit.wikimedia.org/r/621482 (https://phabricator.wikimedia.org/T255176) [10:19:25] (03CR) 10Mvolz: [C: 03+2] Update Zotero to b0a30f98c [deployment-charts] - 10https://gerrit.wikimedia.org/r/621482 (https://phabricator.wikimedia.org/T255176) (owner: 10Mvolz) [10:19:56] (03CR) 10Mvolz: [C: 03+2] Update Zotero to b0a30f98c [deployment-charts] - 10https://gerrit.wikimedia.org/r/621482 (https://phabricator.wikimedia.org/T255176) (owner: 10Mvolz) [10:20:53] (03Merged) 10jenkins-bot: Update Zotero to b0a30f98c [deployment-charts] - 10https://gerrit.wikimedia.org/r/621482 (https://phabricator.wikimedia.org/T255176) (owner: 10Mvolz) [10:21:40] !log cr1-codfw> request chassis routing-engine master switch - T259621 [10:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:25] !log hashar@deploy1001 Started deploy [zuul/deploy@8a05b4d]: Support Gerrit replication events [10:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:49] !log hashar@deploy1001 Finished deploy [zuul/deploy@8a05b4d]: Support Gerrit replication events (duration: 00m 24s) [10:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:00] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:26:06] !log Restarted zuul-merger instances on contint1001 and contint2001 [10:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:18] (03PS1) 10Ayounsi: Revert "Depool codfw for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/621490 [10:27:58] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:34:46] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:35:22] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post 2020-08-19 - https://phabricator.wikimedia.org/T260764 (10mark) @wiki_willy @Papaul It seems we've had an ongoing pattern of crashes with this (rather important) backup host, which means we are not yet able to trust it. Until we are... [10:37:07] (03PS1) 10Giuseppe Lavagetto: confd: disable -watch for machines connected to etcd 3.x [puppet] - 10https://gerrit.wikimedia.org/r/621484 (https://phabricator.wikimedia.org/T260889) [10:37:50] I'm trying to deploy zotero (a service) but helmfile diff isn't showing the change; did I mess it up by +2ing before jenkins got a chance to vote? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/621482 [10:44:04] (03CR) 10Jbond: [C: 04-1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [10:45:02] !log cr1-codfw> request chassis routing-engine master switch - T259621 [10:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:42] (03CR) 10Jbond: [C: 04-1] "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [10:47:03] (03PS1) 10Filippo Giunchedi: aptrepo: import reprepro 'updates' public keys [puppet] - 10https://gerrit.wikimedia.org/r/621485 (https://phabricator.wikimedia.org/T260883) [10:47:05] (03PS1) 10Filippo Giunchedi: aptrepo: import current reprepro 'updates' keys [puppet] - 10https://gerrit.wikimedia.org/r/621506 (https://phabricator.wikimedia.org/T260883) [10:48:56] (03PS2) 10Matthias Mullie: Correct CirrusSearchUserTesting configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621099 (https://phabricator.wikimedia.org/T254388) (owner: 10Ebernhardson) [10:48:58] (03CR) 10Vgutierrez: [C: 03+1] confd: disable -watch for machines connected to etcd 3.x [puppet] - 10https://gerrit.wikimedia.org/r/621484 (https://phabricator.wikimedia.org/T260889) (owner: 10Giuseppe Lavagetto) [10:49:22] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:49:52] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:50:18] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:51:22] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:51:52] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:53:20] !log un-drain cr1-codfw - T259621 [10:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:10] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:57:37] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool codfw for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/621490 (owner: 10Ayounsi) [10:57:59] !log re-pool codfw - T259621 [10:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:54] (03CR) 10Jbond: [C: 03+1] decom releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/621090 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200820T1100). [11:00:04] matthiasmullie and Lucas_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:09] o/ [11:00:16] o/ [11:00:26] matthiasmullie: do you want to deploy your change yourself? [11:00:29] \o/ [11:00:34] yeah sure [11:00:37] ok! [11:00:44] :-) [11:00:52] Shall I start? [11:01:06] sure, go ahead! [11:01:09] let me know when you’re done :) [11:02:06] (03CR) 10Matthias Mullie: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621099 (https://phabricator.wikimedia.org/T254388) (owner: 10Ebernhardson) [11:02:13] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-eqsin.wikimedia.org recovered from Traffic on tunnel link [11:02:55] (03Merged) 10jenkins-bot: Correct CirrusSearchUserTesting configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621099 (https://phabricator.wikimedia.org/T254388) (owner: 10Ebernhardson) [11:03:03] matthiasmullie: nitpick: it's no longer called "SWAT", but "Backport window" ;) [11:03:06] (03PS2) 10Matthias Mullie: Fix testwikidata depicts property id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620694 (https://phabricator.wikimedia.org/T258048) [11:03:26] (03CR) 10Matthias Mullie: [C: 03+2] "Backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620694 (https://phabricator.wikimedia.org/T258048) (owner: 10Matthias Mullie) [11:03:31] Urbanecm: right [11:03:40] I liked BACON, though [11:04:23] (03Merged) 10jenkins-bot: Fix testwikidata depicts property id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620694 (https://phabricator.wikimedia.org/T258048) (owner: 10Matthias Mullie) [11:04:54] I’ll already +2 my backport, since it’ll take a while to go through CI anyways [11:05:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Don't try to load source maps in production [extensions/Wikibase] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621488 (https://phabricator.wikimedia.org/T260852) (owner: 10Lucas Werkmeister (WMDE)) [11:07:16] !log [urbanecm@mwmaint1002 ~]$ mwscript emptyUserGroup.php --wiki=trwiki editor # T260899 [11:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:19] T260899: Remove user from the defunct user group - https://phabricator.wikimedia.org/T260899 [11:07:26] !log mlitn@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Fix testwikidata depicts id & CirrusSearchUserTesting config (duration: 01m 06s) [11:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:50] Lucas_WMDE: done, the floor is yours [11:07:56] thanks! [11:08:00] just waiting for CI then ^^ [11:11:54] is it expected that the logspam-watch contains so much /home/holger/test.php? [11:13:14] probably not [11:14:21] it's innocent logspam AFAICS through [11:14:47] Lucas_WMDE: would you mind me backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/621254 too? [11:15:13] sure, that’s a good idea [11:15:39] (03PS1) 10Urbanecm: Use $user param when filtering edits [extensions/AbuseFilter] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621491 (https://phabricator.wikimedia.org/T258717) [11:16:13] (03PS1) 10Urbanecm: Use $user param when filtering edits [extensions/AbuseFilter] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/621492 (https://phabricator.wikimedia.org/T258717) [11:16:17] (03CR) 10Urbanecm: [C: 03+2] Use $user param when filtering edits [extensions/AbuseFilter] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621491 (https://phabricator.wikimedia.org/T258717) (owner: 10Urbanecm) [11:16:48] (03CR) 10Urbanecm: [C: 03+2] Use $user param when filtering edits [extensions/AbuseFilter] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/621492 (https://phabricator.wikimedia.org/T258717) (owner: 10Urbanecm) [11:19:31] thanks Lucas_WMDE :) [11:27:14] (03Merged) 10jenkins-bot: Don't try to load source maps in production [extensions/Wikibase] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621488 (https://phabricator.wikimedia.org/T260852) (owner: 10Lucas Werkmeister (WMDE)) [11:27:19] whee [11:27:46] yay! [11:28:34] change is on mwdebug1001, testing [11:28:53] yup, that seems to resolve the 404s alright [11:29:04] 👍 [11:29:16] ok, syncing [11:30:10] I think I can sync only php-1.36.0-wmf.5/extensions/Wikibase/client/data-bridge/dist/ [11:30:14] and leave out vue.config.js [11:30:33] you'll see ; [11:30:36] ) [11:30:40] yes, that should be fine [11:32:00] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.5/extensions/Wikibase/client/data-bridge/dist/: Backport: [[gerrit:621488|Don't try to load source maps in production (T260852)]] (duration: 01m 07s) [11:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:05] T260852: Source map error: Error: request failed with status 404 Resource URL - https://phabricator.wikimedia.org/T260852 [11:33:33] Urbanecm: the window is yours [11:33:40] thanks [11:33:46] waiting on CI [11:34:33] skeleton_in_front_of_monitor.jpg [11:35:01] (03Merged) 10jenkins-bot: Use $user param when filtering edits [extensions/AbuseFilter] (wmf/1.36.0-wmf.5) - 10https://gerrit.wikimedia.org/r/621491 (https://phabricator.wikimedia.org/T258717) (owner: 10Urbanecm) [11:35:07] wohoho [11:36:15] (03Merged) 10jenkins-bot: Use $user param when filtering edits [extensions/AbuseFilter] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/621492 (https://phabricator.wikimedia.org/T258717) (owner: 10Urbanecm) [11:36:18] since this involves a job, I have no way but to sync this and test at real servers [11:38:00] and I’m creating a Phab task for those WrapSections errors that popped up in logspam-watch [11:38:15] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.5/extensions/AbuseFilter/includes/AbuseFilterHooks.php: 00da39b6913ac2eab600bbb61258472b60d2cbcb: Use $user param when filtering edits (T258717) (duration: 01m 05s) [11:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:47] Daimona: seems to work! https://test.wikidata.org/w/index.php?title=Special:AbuseLog&wpSearchTitle=Q212703 [11:39:15] Nice! [11:39:47] It does indeed seem to be working fine [11:40:02] even more, it fixed more than one bug it seems [11:40:27] my test before-deployment (https://test.wikidata.org/w/index.php?title=Q212703&diff=530625&oldid=530624) is not tagged with the single purpose tag [11:40:33] but all subsequent edits are, as they should be [11:41:55] filed T260900 for WrapSections [11:41:56] T260900: Return value of Wikimedia\Parsoid\Wt2Html\PP\Processors\WrapSections::getDSR() must be of the type integer, null returned - https://phabricator.wikimedia.org/T260900 [11:44:17] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.4/extensions/AbuseFilter/includes/AbuseFilterHooks.php: d762e7b5526d91fe21e5980bc5e9f3be06a2f85c: Use $user param when filtering edits (T258717) (duration: 01m 05s) [11:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:21] T258717: AbuseFilter doesn't see Wikibase's impersonation of admins when removing pages from a WD item - https://phabricator.wikimedia.org/T258717 [11:45:09] * Urbanecm wishes there was a true way to delete an AF [11:46:09] Good news then, thank you :) [11:46:53] no, thank you - you did the patch, I just pushed it out :-) [11:49:06] !log EU backport window done [11:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:16] thanks Lucas_WMDE [11:49:27] thanks for backporting :) [11:52:56] 10Operations, 10SRE-tools, 10serviceops: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) The pool/depool logic is quite the same as in `sre.discovery.pool/depool` I guess. For checking the service availability on some o... [12:05:29] (03CR) 10Filippo Giunchedi: "There's the obvious disadvantage that this approach, while simple, doesn't remove keys. LMK what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/621485 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [12:10:29] (03PS2) 10Filippo Giunchedi: grafana: remove deprecated settings [puppet] - 10https://gerrit.wikimedia.org/r/621472 (https://phabricator.wikimedia.org/T259143) [12:11:02] 10Operations, 10SRE-tools, 10serviceops: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10Joe) >>! In T260663#6399630, @JMeybohm wrote: > The pool/depool logic is quite the same as in `sre.discovery.pool/depool` I guess. > > For c... [12:14:10] Zuul /CI gracefully restarting [12:14:59] (03PS2) 10JMeybohm: helmfile: refactor eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/621286 (https://phabricator.wikimedia.org/T258572) [12:20:19] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add observability stack hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/621302 (owner: 10Filippo Giunchedi) [12:21:00] 10Operations, 10netops: No Juniper alarms in SNMP for MX204 - https://phabricator.wikimedia.org/T241105 (10ayounsi) 05Open→03Stalled p:05Medium→03Low a:05ayounsi→03None [12:37:06] (03CR) 10Ottomata: "+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/621286 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [12:37:36] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [12:41:12] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post 2020-08-19 - https://phabricator.wikimedia.org/T260764 (10Papaul) @mark I am planing on opening a case with Dell to see what they ca find on the end. [12:42:41] 10Operations, 10Traffic: Enable DNSSEC validation in Wikidough - https://phabricator.wikimedia.org/T259816 (10jbond) > Given that outages due to misconfigured DNSSEC domains are all too common (see https://ianix.com/pub/dnssec-outages.html for a list) Im not sure i would agree that they are "all to common".... [12:42:45] <_joe_> jouncebot: next [12:42:45] In 3 hour(s) and 17 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200820T1600) [12:42:50] <_joe_> cool [12:44:24] !log oblivian@deploy1001 Started deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843 [12:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:28] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [12:46:47] (03CR) 10Jbond: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621485 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [12:50:05] (03PS1) 10Filippo Giunchedi: pontoon: use yaml extension for template [puppet] - 10https://gerrit.wikimedia.org/r/621517 [12:51:27] !log oblivian@deploy1001 Finished deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843 (duration: 07m 03s) [12:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:31] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [12:51:43] <_joe_> for the record, we just rolled back :) [12:52:50] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use yaml extension for template [puppet] - 10https://gerrit.wikimedia.org/r/621517 (owner: 10Filippo Giunchedi) [12:53:07] (03CR) 10Jbond: "We should probably set up some monitoring to detect when keys are expiring. also unless this is critical suggest waiting for moritz to en" [puppet] - 10https://gerrit.wikimedia.org/r/621485 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [12:55:22] 10Puppet, 10DBA, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo) Hey, cloud people! :-D Could I get an ack of this ticket from someone at cloud (or a redirection to the right server owners)? I don't particularly need this f... [12:57:06] 10Puppet, 10DBA, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo) [13:00:19] !log oblivian@deploy1001 Started deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843 [13:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:23] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [13:01:20] (03PS1) 10Jcrespo: mariadb-backups: Ignore backup freshness check for dbprov1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/621520 (https://phabricator.wikimedia.org/T260764) [13:01:35] (03PS2) 10Jcrespo: mariadb-backups: Ignore backup freshness check for dbprov1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/621520 (https://phabricator.wikimedia.org/T260764) [13:04:35] (03CR) 10Jcrespo: "Short term "ack" of icinga alerts (better than acking all checks)." [puppet] - 10https://gerrit.wikimedia.org/r/621520 (https://phabricator.wikimedia.org/T260764) (owner: 10Jcrespo) [13:05:31] (03CR) 10Kormat: [C: 03+1] mariadb-backups: Ignore backup freshness check for dbprov1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/621520 (https://phabricator.wikimedia.org/T260764) (owner: 10Jcrespo) [13:07:18] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/621485 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [13:07:58] (03CR) 10Filippo Giunchedi: aptrepo: import reprepro 'updates' public keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621485 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [13:08:07] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Ignore backup freshness check for dbprov1* hosts [puppet] - 10https://gerrit.wikimedia.org/r/621520 (https://phabricator.wikimedia.org/T260764) (owner: 10Jcrespo) [13:09:25] !log repool wdqs1007 - catched up on lag [13:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:08] (03PS2) 10Filippo Giunchedi: aptrepo: import reprepro 'updates' public keys [puppet] - 10https://gerrit.wikimedia.org/r/621485 (https://phabricator.wikimedia.org/T260883) [13:10:10] (03PS2) 10Filippo Giunchedi: aptrepo: import current reprepro 'updates' keys [puppet] - 10https://gerrit.wikimedia.org/r/621506 (https://phabricator.wikimedia.org/T260883) [13:10:24] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:11:18] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:11:38] !log oblivian@deploy1001 Finished deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843 (duration: 11m 19s) [13:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:42] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [13:14:41] !log oblivian@deploy1001 Started deploy [ores/deploy@74677b6]: switch testwiki to use envoy as a service proxy T244843 (take 2) [13:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:18] !log oblivian@deploy1001 Finished deploy [ores/deploy@74677b6]: switch testwiki to use envoy as a service proxy T244843 (take 2) (duration: 11m 37s) [13:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:22] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [13:29:10] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:30:57] 10Operations, 10Traffic: Enable DNSSEC validation in Wikidough - https://phabricator.wikimedia.org/T259816 (10jbond) >> unless the client set the AD and/or DO bits > > do we know what chrome/FF set on queries? Really not familiar with FF/chrome code but this looks like a no FF: https://searchfox.org/mozil... [13:39:14] !log oblivian@deploy1001 Started deploy [ores/deploy@e860508]: switch everything to use envoy as a service proxy T244843 [13:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:17] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [13:47:57] (03CR) 10Ppchelko: [C: 03+2] Enable TLS for fluent-bit -> eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/621332 (https://phabricator.wikimedia.org/T260626) (owner: 10Ppchelko) [13:48:06] PROBLEM - Check no envoy runtime configuration is left persistent on ores1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 393 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [13:49:09] (03Merged) 10jenkins-bot: Enable TLS for fluent-bit -> eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/621332 (https://phabricator.wikimedia.org/T260626) (owner: 10Ppchelko) [13:51:35] <_joe_> the ores1001 thing is me [13:53:14] !log oblivian@deploy1001 Finished deploy [ores/deploy@e860508]: switch everything to use envoy as a service proxy T244843 (duration: 14m 00s) [13:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:21] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [13:53:56] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [13:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:38] !log oblivian@deploy1001 Started deploy [ores/deploy@8540eec]: various configuration fixes [13:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:49] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) Completed successfully: ` kormat@db2125:~(0:0)$ time sudo mysqldump --single-transaction --all-databases > /dev/null real 249m58.312s user 132m25.215... [14:06:41] !log oblivian@deploy1001 Finished deploy [ores/deploy@8540eec]: various configuration fixes (duration: 09m 03s) [14:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:35] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post 2020-08-19 - https://phabricator.wikimedia.org/T260764 (10Papaul) can we please depool this server [14:17:03] (03PS1) 10Ppchelko: Add puppetca to fluent-bit for cert validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/621527 (https://phabricator.wikimedia.org/T260626) [14:26:25] 10Operations, 10Traffic: Enable DNSSEC validation in Wikidough - https://phabricator.wikimedia.org/T259816 (10ssingh) >>! In T259816#6399814, @jbond wrote: >> Given that outages due to misconfigured DNSSEC domains are all too common (see https://ianix.com/pub/dnssec-outages.html for a list) > Im not sure i wo... [14:27:59] (03CR) 10Ppchelko: [C: 03+2] Add puppetca to fluent-bit for cert validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/621527 (https://phabricator.wikimedia.org/T260626) (owner: 10Ppchelko) [14:29:13] (03Merged) 10jenkins-bot: Add puppetca to fluent-bit for cert validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/621527 (https://phabricator.wikimedia.org/T260626) (owner: 10Ppchelko) [14:32:01] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post 2020-08-19 - https://phabricator.wikimedia.org/T260764 (10jcrespo) Server is down and unusable. The only ask is if data on arrays could possibly kept in order not to lose previous backups. It is also downtime'd until monday. [14:34:16] (03PS1) 10Jcrespo: dbtree: Implement use_index parsing and apply it to QPS query [software/dbtree] - 10https://gerrit.wikimedia.org/r/621529 (https://phabricator.wikimedia.org/T260876) [14:35:59] 10Puppet, 10DBA, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10Andrew) Sorry about the slow response! Since you first opened this ticket I've moved all VMs off this puppetmaster; it's now set to role::spare::system. Anything you... [14:39:48] 10Puppet, 10DBA, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo) Thank you very much @andrew! Indeed, backups jobs have been automatically removed, so no need for any further action, except revert from the alert ignore list.... [14:40:33] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post 2020-08-19 - https://phabricator.wikimedia.org/T260764 (10Papaul) According to Dell the PERC controller is not been detected correctly Step 1- Upgrade the controller drivers. if same problem go to step 2 Step 2- A replacement will... [14:41:29] (03CR) 10Jcrespo: [C: 04-2] "This doesn't work, it is a WIP." [software/dbtree] - 10https://gerrit.wikimedia.org/r/621529 (https://phabricator.wikimedia.org/T260876) (owner: 10Jcrespo) [14:43:14] (03PS1) 10Ppchelko: Properly indent puppetca.crt for fluent-bit [deployment-charts] - 10https://gerrit.wikimedia.org/r/621530 (https://phabricator.wikimedia.org/T260626) [14:43:41] (03PS2) 10Jcrespo: dbtree: Implement use_index parsing and apply it to QPS query [software/dbtree] - 10https://gerrit.wikimedia.org/r/621529 (https://phabricator.wikimedia.org/T260876) [14:45:40] (03PS2) 10Ppchelko: Properly indent puppetca.crt for fluent-bit [deployment-charts] - 10https://gerrit.wikimedia.org/r/621530 (https://phabricator.wikimedia.org/T260626) [14:45:56] PROBLEM - Check systemd state on ores1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:01] (03CR) 10Ppchelko: [C: 03+2] Properly indent puppetca.crt for fluent-bit [deployment-charts] - 10https://gerrit.wikimedia.org/r/621530 (https://phabricator.wikimedia.org/T260626) (owner: 10Ppchelko) [14:46:02] (03PS1) 10Ssingh: wikidough: enable DNSSEC validation in pdns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/621531 (https://phabricator.wikimedia.org/T259816) [14:47:12] (03Merged) 10jenkins-bot: Properly indent puppetca.crt for fluent-bit [deployment-charts] - 10https://gerrit.wikimedia.org/r/621530 (https://phabricator.wikimedia.org/T260626) (owner: 10Ppchelko) [14:47:25] (03PS1) 10Vgutierrez: Update 0005-stats-shortlived.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621532 (https://phabricator.wikimedia.org/T260702) [14:47:26] (03PS1) 10Vgutierrez: Update 0006-transaction-timeout.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621533 (https://phabricator.wikimedia.org/T260702) [14:47:28] (03PS1) 10Vgutierrez: Refresh 0037-force-discard.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621534 (https://phabricator.wikimedia.org/T260702) [14:47:34] 10Operations, 10Traffic, 10Patch-For-Review: Enable DNSSEC validation in Wikidough - https://phabricator.wikimedia.org/T259816 (10jbond) > So this means that they treat the DO bit to not only return the DNSSEC records but also to validate them? I can check this in the code but I just wanted to confirm if I a... [14:47:56] (03CR) 10jerkins-bot: [V: 04-1] Update 0005-stats-shortlived.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621532 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [14:47:58] (03CR) 10jerkins-bot: [V: 04-1] Update 0006-transaction-timeout.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621533 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [14:48:04] (03CR) 10jerkins-bot: [V: 04-1] Refresh 0037-force-discard.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621534 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [14:48:43] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/24595/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/621531 (https://phabricator.wikimedia.org/T259816) (owner: 10Ssingh) [14:49:36] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/621531 (https://phabricator.wikimedia.org/T259816) (owner: 10Ssingh) [14:52:32] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 91054 bytes in 0.333 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [14:52:41] (03PS1) 10Ppchelko: Bump api-gateway chart version to 0.0.19 [deployment-charts] - 10https://gerrit.wikimedia.org/r/621536 [14:53:02] (03CR) 10Ppchelko: [C: 03+2] Bump api-gateway chart version to 0.0.19 [deployment-charts] - 10https://gerrit.wikimedia.org/r/621536 (owner: 10Ppchelko) [14:54:09] (03Merged) 10jenkins-bot: Bump api-gateway chart version to 0.0.19 [deployment-charts] - 10https://gerrit.wikimedia.org/r/621536 (owner: 10Ppchelko) [14:55:46] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [14:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:05] (03PS3) 10Jcrespo: dbtree: Implement use_index parsing and apply it to QPS query [software/dbtree] - 10https://gerrit.wikimedia.org/r/621529 (https://phabricator.wikimedia.org/T260876) [14:57:12] (03PS4) 10Jcrespo: dbtree: Implement use_index parsing and apply it to QPS query [software/dbtree] - 10https://gerrit.wikimedia.org/r/621529 (https://phabricator.wikimedia.org/T260876) [14:58:32] (03CR) 10Jcrespo: "This is not by best work, but it should work, and be as safe as the rest of the codebase." [software/dbtree] - 10https://gerrit.wikimedia.org/r/621529 (https://phabricator.wikimedia.org/T260876) (owner: 10Jcrespo) [14:59:48] PROBLEM - ores_workers_running on ores1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [15:01:43] (03PS1) 10Jcrespo: Revert labtestpuppetmaster2001 addition to backup check ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/621537 (https://phabricator.wikimedia.org/T256846) [15:01:55] (03PS2) 10Jcrespo: Revert labtestpuppetmaster2001 addition to backup check ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/621537 (https://phabricator.wikimedia.org/T256846) [15:03:23] (03PS1) 10Andrew Bogott: ceph backups: exclude integration agents [puppet] - 10https://gerrit.wikimedia.org/r/621538 (https://phabricator.wikimedia.org/T260692) [15:09:20] RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:37] (03PS4) 10Giuseppe Lavagetto: Switch all charts from "stable" to "wmf-stable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/620935 (https://phabricator.wikimedia.org/T258572) [15:10:38] (03CR) 10Jcrespo: [C: 03+2] Revert labtestpuppetmaster2001 addition to backup check ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/621537 (https://phabricator.wikimedia.org/T256846) (owner: 10Jcrespo) [15:11:32] RECOVERY - ores_workers_running on ores1002 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [15:13:23] (03CR) 10Hashar: [C: 03+1] "Backing up by default and then explicitly excluding agents sounds like a good thing to have." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621538 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [15:15:19] 10Puppet, 10DBA, 10Patch-For-Review, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10jcrespo) 05Open→03Resolved a:03jcrespo [15:15:45] (03PS3) 10Cwhite: prometheus: update prometheus-es-exporter config test to enforce namespace [puppet] - 10https://gerrit.wikimedia.org/r/621318 (https://phabricator.wikimedia.org/T256418) [15:16:31] (03CR) 10jerkins-bot: [V: 04-1] prometheus: update prometheus-es-exporter config test to enforce namespace [puppet] - 10https://gerrit.wikimedia.org/r/621318 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [15:16:50] (03CR) 10Cwhite: prometheus: update prometheus-es-exporter config test to enforce namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621318 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [15:19:32] (03CR) 10RLazarus: [C: 03+1] confd: disable -watch for machines connected to etcd 3.x [puppet] - 10https://gerrit.wikimedia.org/r/621484 (https://phabricator.wikimedia.org/T260889) (owner: 10Giuseppe Lavagetto) [15:20:08] (03CR) 10Ssingh: [C: 03+2] wikidough: enable DNSSEC validation in pdns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/621531 (https://phabricator.wikimedia.org/T259816) (owner: 10Ssingh) [15:20:11] (03CR) 10Andrew Bogott: [C: 03+2] ceph backups: exclude integration agents [puppet] - 10https://gerrit.wikimedia.org/r/621538 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [15:22:12] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={0,1,2,3} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&o [15:22:12] urce=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [15:28:30] (03PS1) 10Giuseppe Lavagetto: ores: create a safe-restart script for envoyproxy as well [puppet] - 10https://gerrit.wikimedia.org/r/621539 [15:28:46] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [15:30:38] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:33:10] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post 2020-08-19 - https://phabricator.wikimedia.org/T260764 (10Papaul) There is 1 bad disk on backup2001-array2002 slot 8 [15:33:18] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:36:38] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:37:16] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:37:38] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:38:43] 10Operations, 10Traffic, 10Patch-For-Review: Enable DNSSEC validation in Wikidough - https://phabricator.wikimedia.org/T259816 (10ssingh) >>! In T259816#6400068, @jbond wrote: >> So this means that they treat the DO bit to not only return the DNSSEC records but also to validate them? I can check this in the... [15:39:36] (03PS1) 10VulpesVulpes825: Remove the wrong workmark and tagline for Chinese Wikimedia Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621542 (https://phabricator.wikimedia.org/T260908) [15:39:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ores: create a safe-restart script for envoyproxy as well [puppet] - 10https://gerrit.wikimedia.org/r/621539 (owner: 10Giuseppe Lavagetto) [15:43:44] 10Operations, 10SRE-tools, 10serviceops: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) a:03JMeybohm Claiming as I would like us to have this for helmfile migration. [15:51:52] RECOVERY - Check no envoy runtime configuration is left persistent on ores1001 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:58:18] (03PS1) 10Ppchelko: Disable TLS in staging, enable TLS in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/621549 (https://phabricator.wikimedia.org/T260626) [15:59:14] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:59:42] ^ me [15:59:45] will fix [15:59:56] (03CR) 10Ppchelko: [C: 03+2] Disable TLS in staging, enable TLS in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/621549 (https://phabricator.wikimedia.org/T260626) (owner: 10Ppchelko) [16:00:04] godog and _joe_: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200820T1600). [16:01:04] (03Merged) 10jenkins-bot: Disable TLS in staging, enable TLS in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/621549 (https://phabricator.wikimedia.org/T260626) (owner: 10Ppchelko) [16:01:34] PROBLEM - Host dbprov2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:01:58] PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:02:36] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:10] RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:08:02] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:16] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:38] !log restart elasticsearch on logstash1011 -- long gc runs [16:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:14] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:50] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:19:32] RECOVERY - Check systemd state on backup2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:38] RECOVERY - SSH on backup2001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:20:16] RECOVERY - bacula sd process on backup2001 is OK: PROCS OK: 1 process with UID = 112 (bacula), command name bacula-sd https://wikitech.wikimedia.org/wiki/Bacula [16:20:24] RECOVERY - Check size of conntrack table on backup2001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:22:37] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kubernetes2017.codfw.wmnet - https://phabricator.wikimedia.org/T258745 (10Papaul) ` [edit interfaces interface-range vlan-private1-c-codfw] member ge-1/0/2 { ... } + member ge-3/0/9; [edit interfaces interface-range disabled] -... [16:25:00] RECOVERY - puppet last run on backup2001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:25:08] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kubernetes2017.codfw.wmnet - https://phabricator.wikimedia.org/T258745 (10Papaul) [16:29:00] RECOVERY - MD RAID on backup2001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:34:28] PROBLEM - icinga.wikimedia.org requires authentication on icinga1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:34:31] paged 👋 [16:34:48] and, can confirm, icinga.wm.o doesn't load for me [16:34:58] I'm here too [16:34:58] <_joe_> yes for me neither [16:35:16] PROBLEM - icinga-extmon.wikimedia.org requires authentication on icinga1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized [16:35:20] peeking in [16:35:41] o/ [16:36:03] looks like the host itself is up [16:36:24] RECOVERY - configured eth on backup2001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [16:36:58] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [16:38:01] (03PS1) 10Ppchelko: Remove unnesessary Content-Type header for fluent-bit [deployment-charts] - 10https://gerrit.wikimedia.org/r/621551 (https://phabricator.wikimedia.org/T260626) [16:38:16] still not sure what's up, found anything ? [16:38:46] (03CR) 10Ppchelko: [C: 03+2] Remove unnesessary Content-Type header for fluent-bit [deployment-charts] - 10https://gerrit.wikimedia.org/r/621551 (https://phabricator.wikimedia.org/T260626) (owner: 10Ppchelko) [16:38:54] no not avalible using curl from icinga1001 either [16:39:25] <_joe_> what is listening on port 443? [16:39:30] apache [16:39:52] RECOVERY - Check whether ferm is active by checking the default input chain on backup2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:39:54] <_joe_> so I think apache is having issues [16:39:58] (03Merged) 10jenkins-bot: Remove unnesessary Content-Type header for fluent-bit [deployment-charts] - 10https://gerrit.wikimedia.org/r/621551 (https://phabricator.wikimedia.org/T260626) (owner: 10Ppchelko) [16:40:03] <_joe_> it's not completing the tls negotiation [16:40:20] <_joe_> [Thu Aug 20 16:30:22.153475 2020] [mpm_prefork:error] [pid 187664] AH00161: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting [16:40:32] PROBLEM - HTTPS on icinga1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Icinga [16:40:40] <_joe_> !log restarted apache2 on icinga1001 [16:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:02] RECOVERY - icinga-extmon.wikimedia.org requires authentication on icinga1001 is OK: HTTP OK: Status line output matched HTTP/1.1 403 - 437 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized [16:41:12] <_joe_> wfm now [16:41:41] yes same [16:41:52] <_joe_> so maybe it will fail again [16:42:05] yes, I'm in [16:42:10] RECOVERY - HTTPS on icinga1001 is OK: SSL OK - Certificate icinga.wikimedia.org valid until 2020-11-03 17:19:04 +0000 (expires in 75 days) https://wikitech.wikimedia.org/wiki/Icinga [16:42:14] RECOVERY - icinga.wikimedia.org requires authentication on icinga1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 596 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:43:13] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:37] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [16:45:38] RECOVERY - dhclient process on backup2001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [16:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:30] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:06] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:07] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:15] dont see anything obvious going back to dinner [16:51:43] ACKNOWLEDGEMENT - MegaRAID on backup2001 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T260927 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:51:47] 10Operations, 10ops-codfw: Degraded RAID on backup2001 - https://phabricator.wikimedia.org/T260927 (10ops-monitoring-bot) [16:53:07] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki2001 - https://phabricator.wikimedia.org/T259825 (10Papaul) [16:54:32] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:56:25] (03PS2) 10Bstorm: cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) [16:57:38] (03CR) 10Bstorm: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [16:57:59] (03PS3) 10Bstorm: cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) [17:00:04] halfak and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200820T1700). [17:13:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:13:50] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:14:48] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:15:08] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:22:19] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [17:22:19] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [17:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:13] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [17:23:14] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.prepare-upgrade (exit_code=97) [17:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:17] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [17:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:24] (03CR) 10Bstorm: [C: 03+2] wikireplicas: add wikireplica cookbook to add a wiki [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:24:28] (03Merged) 10jenkins-bot: wikireplicas: add wikireplica cookbook to add a wiki [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:28:54] !log restart elasticsearch on logstash1020 -- high gc runtimes [17:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:40] !log restart elasticsearch on logstash1012 (not 1020) -- high gc runtimes [17:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:57] (03CR) 10Dzahn: "Bug: T257906" [puppet] - 10https://gerrit.wikimedia.org/r/621338 (owner: 10Dzahn) [17:39:15] (03PS1) 10Dzahn: testreduce: add parameters to control client services [puppet] - 10https://gerrit.wikimedia.org/r/621555 (https://phabricator.wikimedia.org/T257906) [17:40:19] (03CR) 10jerkins-bot: [V: 04-1] testreduce: add parameters to control client services [puppet] - 10https://gerrit.wikimedia.org/r/621555 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [17:53:17] (03PS2) 10Dzahn: testreduce: add parameters to control client services [puppet] - 10https://gerrit.wikimedia.org/r/621555 (https://phabricator.wikimedia.org/T257906) [17:53:52] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200820T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:08:05] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24596/" [puppet] - 10https://gerrit.wikimedia.org/r/621555 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [18:16:04] PROBLEM - Check systemd state on ores1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:19] checking ores [18:18:02] !log testreduce1001 - rt_client and vd_client now properly stopped by puppet T257906 [18:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:10] T257906: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 [18:19:08] !log ores1004 - starting failed celery-ores-worker [18:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:58] RECOVERY - Check systemd state on ores1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:08] (03PS1) 10Hashar: ci: bring back jenkins-deploy in the docker group [puppet] - 10https://gerrit.wikimedia.org/r/621563 (https://phabricator.wikimedia.org/T260930) [18:37:21] (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/621251 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [18:37:38] (03CR) 10Hashar: profile::ci::docker: manage all group membership in data module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572707 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [18:38:15] (03CR) 10jerkins-bot: [V: 04-1] ci: bring back jenkins-deploy in the docker group [puppet] - 10https://gerrit.wikimedia.org/r/621563 (https://phabricator.wikimedia.org/T260930) (owner: 10Hashar) [18:39:59] (03PS1) 10Dzahn: service::catalog: switch ORES to encryption: true [puppet] - 10https://gerrit.wikimedia.org/r/621564 [18:40:11] (03PS2) 10Hashar: ci: bring back jenkins-deploy in the docker group [puppet] - 10https://gerrit.wikimedia.org/r/621563 (https://phabricator.wikimedia.org/T260930) [18:42:46] !log bstorm@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [18:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:53] !log bstorm@cumin1001 END (FAIL) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=99) [18:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:13] (03CR) 10Hashar: [C: 03+1] "That restores the behavior before https://gerrit.wikimedia.org/r/c/operations/puppet/+/572707" [puppet] - 10https://gerrit.wikimedia.org/r/621563 (https://phabricator.wikimedia.org/T260930) (owner: 10Hashar) [18:45:43] (03CR) 10Dzahn: [C: 03+1] "realm checks aren't ideal but I wouldn't know of a better way to fix this in "labs" so suggested this one" [puppet] - 10https://gerrit.wikimedia.org/r/621563 (https://phabricator.wikimedia.org/T260930) (owner: 10Hashar) [18:47:51] (03PS1) 10Andrew Bogott: ceph backups: exclude a few more VMs from future backups [puppet] - 10https://gerrit.wikimedia.org/r/621567 [18:47:54] (03PS1) 10Andrew Bogott: wmcs-backup-instances.py: fix cleanup of old snaps [puppet] - 10https://gerrit.wikimedia.org/r/621568 [18:49:45] (03CR) 10Andrew Bogott: [C: 03+2] ceph backups: exclude a few more VMs from future backups [puppet] - 10https://gerrit.wikimedia.org/r/621567 (owner: 10Andrew Bogott) [18:50:08] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup-instances.py: fix cleanup of old snaps [puppet] - 10https://gerrit.wikimedia.org/r/621568 (owner: 10Andrew Bogott) [18:51:07] (03CR) 10CDanis: [C: 03+2] "Confirmed that this fixes the test added in I8493542" [puppet] - 10https://gerrit.wikimedia.org/r/621342 (https://phabricator.wikimedia.org/T260520) (owner: 10Dzahn) [18:51:14] (03CR) 10CDanis: [C: 03+2] varnish: add wikilovesmonuments to tests for maps access [puppet] - 10https://gerrit.wikimedia.org/r/621347 (https://phabricator.wikimedia.org/T260520) (owner: 10Dzahn) [18:56:19] 🤝 thanks cdanis [18:59:17] 👍 [19:00:05] twentyafterfour and marxarelli: May I have your attention please! Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200820T1900) [19:00:30] (03CR) 10Dzahn: [C: 03+2] ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [19:01:15] o/ [19:01:43] twentyafterfour: heyo o/ [19:02:40] !log 1.36.0-wmf.5 has no known blockers and logspam is cleaned up, time to roll group2 wikis to wmf.5 [19:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:11] (03PS1) 10Bstorm: wikireplicas: fix typo in the dns script for wikireplicas [cookbooks] - 10https://gerrit.wikimedia.org/r/621574 (https://phabricator.wikimedia.org/T260389) [19:07:22] !log switching document root of integration.wikimedia.org to scap (T149924) [19:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:26] T149924: Clear /srv/.git on contint1001; move integration.wikimedia.org docroot to new location - https://phabricator.wikimedia.org/T149924 [19:08:49] (03PS1) 10Ppchelko: api-gateway: JSON error responses and empty path redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/621575 (https://phabricator.wikimedia.org/T260795) [19:08:58] !log restarted apache on cont2001 for integration.wikimedia.org docroot change [19:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:49] (03PS2) 10Dzahn: service::catalog: switch ORES to encryption: true [puppet] - 10https://gerrit.wikimedia.org/r/621564 [19:11:27] (03CR) 10Bstorm: [C: 03+2] wikireplicas: fix typo in the dns script for wikireplicas [cookbooks] - 10https://gerrit.wikimedia.org/r/621574 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [19:12:32] (03Merged) 10jenkins-bot: wikireplicas: fix typo in the dns script for wikireplicas [cookbooks] - 10https://gerrit.wikimedia.org/r/621574 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [19:13:06] uhm, Deferred update 'AtomicSectionUpdate_MovePage::moveUnsafe' failed to run. [19:15:01] (03PS1) 1020after4: all wikis to 1.36.0-wmf.5 refs T257973 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621576 [19:15:03] (03CR) 1020after4: [C: 03+2] all wikis to 1.36.0-wmf.5 refs T257973 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621576 (owner: 1020after4) [19:15:42] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.5 refs T257973 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621576 (owner: 1020after4) [19:17:21] (03PS1) 10CRusnov: netbox: Update to v2.8.9-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/621577 [19:17:42] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.5 refs T257973 [19:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:46] T257973: 1.36.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T257973 [19:17:46] !log bstorm@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [19:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:48] !log bstorm@cumin1001 END (FAIL) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=99) [19:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:10] yay [19:19:41] twentyafterfour: wait, is that a sarcastic yay or earnest? [19:19:56] lgtm is why i ask :) just want to make sure [19:20:10] for once it was non-sarcastic [19:20:15] haha [19:20:25] sarcasm is my usual for sure [19:20:27] then yay (same) [19:20:48] yeah this has been the easiest train in ...ever [19:20:49] (03PS2) 10Ppchelko: api-gateway: JSON error responses and empty path redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/621575 (https://phabricator.wikimedia.org/T260795) [19:21:01] * marxarelli knocks wood [19:21:05] lol [19:21:23] I would jinx it like that [19:21:35] (03CR) 10Ppchelko: [C: 03+2] api-gateway: JSON error responses and empty path redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/621575 (https://phabricator.wikimedia.org/T260795) (owner: 10Ppchelko) [19:22:39] (03Merged) 10jenkins-bot: api-gateway: JSON error responses and empty path redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/621575 (https://phabricator.wikimedia.org/T260795) (owner: 10Ppchelko) [19:24:19] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [19:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:16] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [19:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:38] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [19:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:47] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [19:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:41] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) a:05RLazarus→03None [19:39:57] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) a:03jijiki [19:47:47] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): maps.wikilovesmonuments.org returns a HTTP 429 error (let it access varnish maps_domains) - https://phabricator.wikimedia.org/T260520 (10Dzahn) @Zache You may try again now. Is the 429 gone? [19:51:27] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): maps.wikilovesmonuments.org returns a HTTP 429 error (let it access varnish maps_domains) - https://phabricator.wikimedia.org/T260520 (10Dzahn) 05Open→03Resolved p:05Triage→03High a:03Dzahn Thanks to @cdanis for deploying my change.... [19:53:23] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10nskaggs) [19:57:10] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.prepare-upgrade (exit_code=97) [19:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:33] (03PS1) 10Ppchelko: Properly set x-internal-host for domains with no lang [deployment-charts] - 10https://gerrit.wikimedia.org/r/621581 (https://phabricator.wikimedia.org/T235276) [19:59:05] (03CR) 10Ppchelko: [C: 03+2] "Obvious bug" [deployment-charts] - 10https://gerrit.wikimedia.org/r/621581 (https://phabricator.wikimedia.org/T235276) (owner: 10Ppchelko) [20:00:12] (03Merged) 10jenkins-bot: Properly set x-internal-host for domains with no lang [deployment-charts] - 10https://gerrit.wikimedia.org/r/621581 (https://phabricator.wikimedia.org/T235276) (owner: 10Ppchelko) [20:01:17] (03PS1) 10CDanis: depool eqsin for router upgrade [dns] - 10https://gerrit.wikimedia.org/r/621583 (https://phabricator.wikimedia.org/T259621) [20:01:46] (03CR) 10CDanis: [C: 03+2] depool eqsin for router upgrade [dns] - 10https://gerrit.wikimedia.org/r/621583 (https://phabricator.wikimedia.org/T259621) (owner: 10CDanis) [20:02:00] !log depool eqsin for router upgrade [20:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:46] (03CR) 10Dzahn: varnish: add wikilovesmonuments to tests for maps access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621347 (https://phabricator.wikimedia.org/T260520) (owner: 10Dzahn) [20:07:57] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [20:09:17] (03CR) 10Dave Pifke: "Sorry, should have posted a link to this here: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/621095" [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [20:11:19] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [20:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:48] !log cdanis@cr2-eqsin> request vmhost software add /var/tmp/junos-vmhost-install-mx-x86-64-18.2R3-S5.3.tgz [20:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:36] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [20:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:38] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [20:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:55] !log cdanis@cr2-eqsin> request vmhost reboot [20:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:02] (03PS7) 10Jeena Huneidi: [WIP] Script to update image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) [20:29:44] (03PS8) 10Jeena Huneidi: Script to update image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) [20:29:54] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 76, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:30:05] (03PS1) 10Ppchelko: Api-gateway: fix pathing map [deployment-charts] - 10https://gerrit.wikimedia.org/r/621585 (https://phabricator.wikimedia.org/T235276) [20:30:56] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:31:01] ^ expected [20:31:04] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:31:13] ok *that* is not expected, but, the site is depooled [20:31:14] (03CR) 10Ppchelko: [C: 03+2] Api-gateway: fix pathing map [deployment-charts] - 10https://gerrit.wikimedia.org/r/621585 (https://phabricator.wikimedia.org/T235276) (owner: 10Ppchelko) [20:31:28] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:31:46] (03PS2) 10Thcipriani: Fix deploy script [deployment-charts] - 10https://gerrit.wikimedia.org/r/619858 (https://phabricator.wikimedia.org/T259684) (owner: 10Jeena Huneidi) [20:32:18] (03Merged) 10jenkins-bot: Api-gateway: fix pathing map [deployment-charts] - 10https://gerrit.wikimedia.org/r/621585 (https://phabricator.wikimedia.org/T235276) (owner: 10Ppchelko) [20:33:20] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:33:26] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:33:48] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:34:00] (03CR) 10Thcipriani: [C: 03+2] Fix deploy script [deployment-charts] - 10https://gerrit.wikimedia.org/r/619858 (https://phabricator.wikimedia.org/T259684) (owner: 10Jeena Huneidi) [20:34:50] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [20:34:50] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:13] (03Merged) 10jenkins-bot: Fix deploy script [deployment-charts] - 10https://gerrit.wikimedia.org/r/619858 (https://phabricator.wikimedia.org/T259684) (owner: 10Jeena Huneidi) [20:36:04] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [20:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:40] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [20:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:45] (03CR) 10Dzahn: "ah, cool. thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [20:44:45] (03PS1) 10Ahmon Dancy: Updated some cross references in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621589 [20:50:30] (03PS1) 10CDanis: Revert "depool eqsin for router upgrade" [dns] - 10https://gerrit.wikimedia.org/r/621495 (https://phabricator.wikimedia.org/T259621) [20:52:55] (03CR) 10CDanis: [C: 03+2] Revert "depool eqsin for router upgrade" [dns] - 10https://gerrit.wikimedia.org/r/621495 (https://phabricator.wikimedia.org/T259621) (owner: 10CDanis) [20:53:07] !log repool eqsin [20:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:59] 10Operations, 10Traffic: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10eprodromou) [21:00:45] (03CR) 10Cwhite: [V: 03+2 C: 03+2] prometheus: update prometheus-es-exporter config test to enforce namespace [puppet] - 10https://gerrit.wikimedia.org/r/621318 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [21:04:24] (03CR) 10Cwhite: [C: 03+2] prometheus: cleanup count of all logs [puppet] - 10https://gerrit.wikimedia.org/r/621309 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [21:05:07] (03PS2) 10Cwhite: prometheus: use aggs to consolidate mediawiki logging metrics [puppet] - 10https://gerrit.wikimedia.org/r/621098 (https://phabricator.wikimedia.org/T256418) [21:06:01] (03CR) 10jerkins-bot: [V: 04-1] prometheus: use aggs to consolidate mediawiki logging metrics [puppet] - 10https://gerrit.wikimedia.org/r/621098 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [21:06:03] (03PS3) 10Cwhite: prometheus: use aggs to consolidate mediawiki logging metrics [puppet] - 10https://gerrit.wikimedia.org/r/621098 (https://phabricator.wikimedia.org/T256418) [21:07:35] (03CR) 10Cwhite: [C: 03+2] prometheus: use aggs to consolidate mediawiki logging metrics [puppet] - 10https://gerrit.wikimedia.org/r/621098 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [21:08:01] (03CR) 10Cwhite: [C: 03+2] prometheus: use aggs to consolidate mediawiki logging metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621098 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [21:18:52] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:22:21] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [21:24:48] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:28:44] (03PS1) 10Cwhite: prometheus: add apache2 es-exporter config [puppet] - 10https://gerrit.wikimedia.org/r/621597 (https://phabricator.wikimedia.org/T256418) [21:34:38] (03PS1) 10Cwhite: logstash: clean up statsd_exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/621599 (https://phabricator.wikimedia.org/T256418) [21:37:44] (03CR) 10BryanDavis: [C: 03+1] hieradata: disable panel html sanitization for grafana-labs [puppet] - 10https://gerrit.wikimedia.org/r/621197 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [21:48:12] (03CR) 10Dave Pifke: [C: 03+1] base: remove tungsten from check-microcode.py [puppet] - 10https://gerrit.wikimedia.org/r/620131 (https://phabricator.wikimedia.org/T260395) (owner: 10Dzahn) [21:48:25] (03CR) 10Dave Pifke: [C: 03+1] remove tungsten from site, DHCP and partman [puppet] - 10https://gerrit.wikimedia.org/r/620129 (https://phabricator.wikimedia.org/T260395) (owner: 10Dzahn) [21:48:50] (03CR) 10Dave Pifke: [C: 03+1] delete role::xhgui::app [puppet] - 10https://gerrit.wikimedia.org/r/620130 (https://phabricator.wikimedia.org/T260395) (owner: 10Dzahn) [21:49:42] oh :) yay! [21:49:53] dpifke: really anytime now? very cool [21:52:00] Yup, per my last comment on T180761, I've archived the old MongoDB data and confirmed nothing has been written since Tuesday, so we're good to go. [21:52:01] T180761: Move XHGui from tungsten to xhgui-001 - https://phabricator.wikimedia.org/T180761 [21:59:19] dpifke: great! thanks [22:02:58] dpifke: where are the profiles dumped to btw? [22:03:32] or was it just temporaty to check the latest entry [22:03:37] JSON & SQL files are in my homedir on xhgui1001. LMK if there's a better place to keep them longer-term. [22:04:06] Should only be needed if we later catch something terribly wrong with the migration script, which seems unlikely at this point. [22:04:19] dpifke: you don't need anything from /home/dpifke on tungsten? 1.8G ? [22:04:44] mutante: Has been copied to xhgui1001. [22:04:57] people1001 is a failry generic place for that, it has bacula backups as well [22:05:12] mutante: do we have backups of bastion /home as well? [22:05:52] Krinkle: yes, /home on bastion hosts is in bacula [22:06:01] bastion might be better for larger files like this since people1001 is a VM [22:06:16] anyway, either is fine I suppose. [22:06:35] OK, I'll copy the final JSON dump there. We also now have backups of the data in MariaDB. [22:06:36] if it's actually for long-term storage of larger files then i would say dumps servers [22:06:56] right., that's for public data sets and such [22:08:04] mutante: people2001 is unused/standby, right? Or are they kept in sync continously active-active? [22:08:06] is this public or private? [22:08:14] temp private backup just in case [22:08:49] But public in the sense that it doesn't contain anything sensitive that wouldn't be visible from the UI. [22:08:56] right [22:10:02] 10Operations, 10Performance-Team, 10observability, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle) [22:10:13] Krinkle: it's a mix. they are kept in sync automatically. there is only one "active" as in the source for syncing. only one is serving traffic as of right now but it might change [22:10:43] 10Operations, 10Performance-Team, 10observability, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle) [22:10:47] mutante: k [22:11:46] hmm. I think my slight preference would be to keep using xhgui* itself and i just add the /home dir there to backups [22:12:19] 10Operations, 10Performance-Team, 10observability, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle) [22:12:25] mutante: it's a one off dump in case we find any issue that would need the mongo data from tungsten. [22:12:59] the xhgui hosts shouldn't contain any state [22:13:03] don't really want to encourage using people2001. that will lead to questions later [22:13:09] how to sync in both directions [22:13:21] yeah no, that was confusing on my part, I was just wondering unrelatedly [22:13:28] ofc it would be on people100x [22:14:00] ok, then use people1002 [22:14:26] 🍾 osmium, hafnium, tungsten [22:14:30] The end of an era. [22:14:36] ( https://phabricator.wikimedia.org/T158837 ) [22:15:00] i should say "use peopleweb.discovery.wmnet" and you can't go wrong :) [22:16:19] ok, let's get to that last checkbox, cool [22:16:37] getting ready to run the decom cookbook and kill tungsten [22:16:38] (03PS9) 10Jeena Huneidi: Script to update image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) [22:16:47] (03CR) 10jerkins-bot: [V: 04-1] Script to update image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [22:17:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [22:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:30] (03PS1) 10Effie Mouzeli: helmfile: add values for staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/621605 (https://phabricator.wikimedia.org/T256973) [22:18:38] Wiped bootloaders [22:18:40] there it goes [22:18:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [22:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:20] (03PS2) 10Effie Mouzeli: helmfile: add values for staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/621605 (https://phabricator.wikimedia.org/T256973) [22:20:00] !log permanently shut down tungsten.eqiad.wmnet T260395 T158837 T180761 T224549 [22:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:36] T158837: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 [22:20:36] T180761: Move XHGui from tungsten to xhgui-001 - https://phabricator.wikimedia.org/T180761 [22:20:37] T224549: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 [22:20:37] T260395: decom tungsten - https://phabricator.wikimedia.org/T260395 [22:21:08] (03PS2) 10Dzahn: remove tungsten from site, DHCP and partman [puppet] - 10https://gerrit.wikimedia.org/r/620129 (https://phabricator.wikimedia.org/T260395) [22:21:55] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [22:22:56] 10Operations, 10Performance-Team, 10observability, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Dzahn) [22:23:17] 10Operations, 10Performance-Team, 10observability, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Dzahn) p:05Low→03Medium [22:23:38] (03CR) 10Dzahn: [C: 03+2] remove tungsten from site, DHCP and partman [puppet] - 10https://gerrit.wikimedia.org/r/620129 (https://phabricator.wikimedia.org/T260395) (owner: 10Dzahn) [22:24:42] (03PS2) 10Dzahn: delete role::xhgui::app [puppet] - 10https://gerrit.wikimedia.org/r/620130 (https://phabricator.wikimedia.org/T260395) [22:28:42] (03PS3) 10Dzahn: delete role::xhgui::app [puppet] - 10https://gerrit.wikimedia.org/r/620130 (https://phabricator.wikimedia.org/T260395) [22:29:44] (03CR) 10Dzahn: [C: 03+2] delete role::xhgui::app [puppet] - 10https://gerrit.wikimedia.org/r/620130 (https://phabricator.wikimedia.org/T260395) (owner: 10Dzahn) [22:31:49] (03PS1) 10Dzahn: cumin: update xhgui alias to apply to new role name [puppet] - 10https://gerrit.wikimedia.org/r/621606 (https://phabricator.wikimedia.org/T260395) [22:33:11] (03CR) 10Dzahn: [C: 03+2] cumin: update xhgui alias to apply to new role name [puppet] - 10https://gerrit.wikimedia.org/r/621606 (https://phabricator.wikimedia.org/T260395) (owner: 10Dzahn) [22:33:58] (03PS2) 10Dzahn: base: remove tungsten from check-microcode.py [puppet] - 10https://gerrit.wikimedia.org/r/620131 (https://phabricator.wikimedia.org/T260395) [22:35:15] (03CR) 10Dzahn: [C: 03+2] base: remove tungsten from check-microcode.py [puppet] - 10https://gerrit.wikimedia.org/r/620131 (https://phabricator.wikimedia.org/T260395) (owner: 10Dzahn) [22:36:35] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki2001 - https://phabricator.wikimedia.org/T259825 (10Papaul) [22:36:54] (03PS1) 10Dzahn: hiera: remove hosts/tungsten.yaml [puppet] - 10https://gerrit.wikimedia.org/r/621608 (https://phabricator.wikimedia.org/T260395) [22:37:41] (03CR) 10Dzahn: [C: 03+2] hiera: remove hosts/tungsten.yaml [puppet] - 10https://gerrit.wikimedia.org/r/621608 (https://phabricator.wikimedia.org/T260395) (owner: 10Dzahn) [22:38:38] weird, this host did not even have mgmt names [22:39:03] oh.. no.. i am confused because DNS stuff partially moved to netbox [22:39:32] so i guess in theory the decom cookbook should have removed it but did not yet [22:40:37] (03PS1) 10Dzahn: decom tungsten.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/621609 (https://phabricator.wikimedia.org/T260395) [22:42:44] (03PS2) 10Dzahn: decom tungsten.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/621609 (https://phabricator.wikimedia.org/T260395) [22:43:03] (03PS1) 10Papaul: DNS: Add production DNS for pki2001 [dns] - 10https://gerrit.wikimedia.org/r/621610 [22:44:24] (03CR) 10Papaul: [C: 03+2] DNS: Add production DNS for pki2001 [dns] - 10https://gerrit.wikimedia.org/r/621610 (owner: 10Papaul) [22:45:48] (03CR) 10Dzahn: "is this not "rpki"? That's what shows up on https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions" [dns] - 10https://gerrit.wikimedia.org/r/621610 (owner: 10Papaul) [22:46:03] papaul: rpki != pki ? [22:46:32] or are they 2 different things.. then they are just easily mixed up i guess [22:46:41] only rpki is already on the wikitech page though [22:46:49] mutante: the task said pki [22:47:17] papaul: yea, but it also said " will update https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions if name is agreed " and it's rpki there [22:48:01] or this is unrelated to existing rpki machines and just sounds very similar [22:48:18] mutante: don't know [22:48:36] yea, i can't tell from the ticket what this is for either [22:49:08] looks at https://phabricator.wikimedia.org/T259117 now [22:50:12] (03PS3) 10Cwhite: profile: install and configure statsd_exporter and retarget statsv [puppet] - 10https://gerrit.wikimedia.org/r/615269 (https://phabricator.wikimedia.org/T180105) [22:50:40] papaul: ok, looks like it's different things. rpki = networking/traffic, pki = jbond but very limited detail there [22:50:49] (03PS4) 10Cwhite: profile: install and configure statsd_exporter and retarget statsv [puppet] - 10https://gerrit.wikimedia.org/r/615269 (https://phabricator.wikimedia.org/T180105) [22:50:50] the wikitech page just needs the update still [22:51:04] mutante: got you [22:51:05] (03PS5) 10Cwhite: profile: install and configure statsd_exporter and retarget statsv [puppet] - 10https://gerrit.wikimedia.org/r/615269 (https://phabricator.wikimedia.org/T180105) [22:52:12] (03PS6) 10Cwhite: profile: install and configure statsd_exporter and retarget statsv [puppet] - 10https://gerrit.wikimedia.org/r/615269 (https://phabricator.wikimedia.org/T180105) [22:53:37] i could see that not being the last time where rpki1001 and pki1001 are mixed [22:56:12] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By:2020-08-17) label/setup/install pki2001 - https://phabricator.wikimedia.org/T259825 (10Papaul) [22:56:51] (03PS3) 10Dzahn: decom tungsten.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/621609 (https://phabricator.wikimedia.org/T260395) [22:56:56] (03CR) 10Dzahn: [C: 03+2] decom tungsten.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/621609 (https://phabricator.wikimedia.org/T260395) (owner: 10Dzahn) [22:58:04] @seen paladox [22:58:04] mutante: paladox is in here, right now [22:59:09] Hi [23:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200820T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:01:35] paladox: :) PM [23:06:31] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By:2020-08-17) label/setup/install pki2001 - https://phabricator.wikimedia.org/T259825 (10Papaul) Please provide a specific partition/raid configuration for 4x4TB disks. Once I have the information i will proceed with the install. Thanks [23:11:06] (03PS1) 10Papaul: DHCP: Add MAC address for pki2001 [puppet] - 10https://gerrit.wikimedia.org/r/621612 (https://phabricator.wikimedia.org/T259825) [23:12:34] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for pki2001 [puppet] - 10https://gerrit.wikimedia.org/r/621612 (https://phabricator.wikimedia.org/T259825) (owner: 10Papaul) [23:13:07] (03PS2) 10Papaul: DHCP: Add MAC address for pki2001 [puppet] - 10https://gerrit.wikimedia.org/r/621612 (https://phabricator.wikimedia.org/T259825) [23:13:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Performance-Team, 10serviceops: decom tungsten - https://phabricator.wikimedia.org/T260395 (10Dzahn) a:05Dzahn→03Cmjohnson [23:13:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10Performance-Team, 10serviceops: decom tungsten - https://phabricator.wikimedia.org/T260395 (10Dzahn) [23:17:21] 10Operations, 10serviceops, 10Patch-For-Review: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10Dzahn) p:05Triage→03Medium [23:17:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10Dzahn) [23:18:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10Dzahn) a:05Cmjohnson→03None [23:18:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10Dzahn) I am not sure if I am supposed to directly assign to people now or keep just using the ops- tag as before. Looks like the words on the decom template got... [23:19:32] 10Operations, 10Performance-Team, 10observability, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Dzahn) From my side this ticket looks done now. [23:26:49] 10Operations, 10Cloud-VPS, 10User-fgiunchedi, 10cloud-services-team (Kanban): CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10Dzahn) [23:27:15] 10Operations, 10Cloud-VPS, 10User-fgiunchedi, 10cloud-services-team (Kanban): CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10Dzahn) tungsten has been decom'ed today. removed from the list [23:28:54] 10Operations, 10observability, 10Graphite, 10audits-data-retention: graphite-web logs are not rotated - https://phabricator.wikimedia.org/T86546 (10Dzahn) [23:30:14] 10Operations, 10observability, 10Graphite, 10audits-data-retention: graphite-web logs are not rotated - https://phabricator.wikimedia.org/T86546 (10Dzahn) This task is over 5 years old and i found it while searching for "tungsten" when i decom'ed it today. graphite-web has not been on that host in a long... [23:32:35] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kubernetes2017.codfw.wmnet - https://phabricator.wikimedia.org/T258745 (10Papaul) [23:32:47] (03PS10) 10Jeena Huneidi: Script to update image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) [23:34:35] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Audit /etc/apt directories - https://phabricator.wikimedia.org/T214605 (10Dzahn) This ticket actually sounds like it was completed but is still open. ? [23:37:57] (03PS1) 10Papaul: DNS: Add production DNS for kubernetes2017 [dns] - 10https://gerrit.wikimedia.org/r/621613 [23:39:40] (03CR) 10Papaul: [C: 03+2] DNS: Add production DNS for kubernetes2017 [dns] - 10https://gerrit.wikimedia.org/r/621613 (owner: 10Papaul) [23:50:43] (03CR) 10Dzahn: ""lulu present in privileged LDAP group (nda),but not present in data.yaml"" [puppet] - 10https://gerrit.wikimedia.org/r/621257 (owner: 10Jbond)