[00:42:03] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:51:27] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [01:30:39] 10Operations, 10serviceops, 10PHP 7.2 support: (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 (10Aklapper) Anybody knows if this is still an issue nowadays? [03:11:08] (03PS2) 10Dzahn: Add our wiki blog to EN planet [puppet] - 10https://gerrit.wikimedia.org/r/560154 (owner: 10Jeroen De Dauw) [03:11:52] (03PS3) 10Dzahn: planet: Add professional.wiki to en feeds [puppet] - 10https://gerrit.wikimedia.org/r/560154 (owner: 10Jeroen De Dauw) [03:11:59] (03CR) 10Dzahn: [C: 03+2] planet: Add professional.wiki to en feeds [puppet] - 10https://gerrit.wikimedia.org/r/560154 (owner: 10Jeroen De Dauw) [03:22:41] 10Operations, 10serviceops, 10PHP 7.2 support: (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 (10Dzahn) It seems the pattern has stopped. I can still see it on Nov 30th: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-... [03:33:56] 10Operations: Audit the WMF LDAP group and limit its permissions - https://phabricator.wikimedia.org/T240870 (10MZMcBride) Related: {T62412}. This group has been problematic for a while. [05:12:41] 10Operations, 10Traffic: Add more detailed instructions to the "sec-advice" page - https://phabricator.wikimedia.org/T241309 (10Dzahn) The content of the sec-warning page can be found in the repository: [[ https://gerrit.wikimedia.org/r/q/operations%252Fpuppet | operations/puppet ]] at `~/puppet/modules/varnis... [05:15:39] (03PS1) 10CRusnov: netbox: Add hiera entries for ESAMS and ULSFO ganeti sync [puppet] - 10https://gerrit.wikimedia.org/r/560359 (https://phabricator.wikimedia.org/T239123) [05:18:53] 10Operations: Audit the WMF LDAP group and limit its permissions - https://phabricator.wikimedia.org/T240870 (10Dzahn) Wouldn't it be better to create a new group that actually means "wmf employee" and knowingly give that the permissions we agree it should have, rather than trying to remove stuff from the existi... [05:21:55] Its like the on-wiki staff group, which is also silly [05:22:22] (03PS3) 10Dzahn: Revert "Remove access for bmansurov" [puppet] - 10https://gerrit.wikimedia.org/r/559651 (https://phabricator.wikimedia.org/T241089) [05:23:27] (03CR) 10jerkins-bot: [V: 04-1] Revert "Remove access for bmansurov" [puppet] - 10https://gerrit.wikimedia.org/r/559651 (https://phabricator.wikimedia.org/T241089) (owner: 10Dzahn) [05:28:28] (03PS1) 10Andrew Bogott: Openstack keystone (ocata/stretch): pull in python-ldap from the buster repo [puppet] - 10https://gerrit.wikimedia.org/r/560360 (https://phabricator.wikimedia.org/T229227) [05:29:06] (03CR) 10CRusnov: [C: 03+2] "self merge - trivial change and also tested change manually." [puppet] - 10https://gerrit.wikimedia.org/r/560359 (https://phabricator.wikimedia.org/T239123) (owner: 10CRusnov) [05:29:10] (03CR) 10jerkins-bot: [V: 04-1] Openstack keystone (ocata/stretch): pull in python-ldap from the buster repo [puppet] - 10https://gerrit.wikimedia.org/r/560360 (https://phabricator.wikimedia.org/T229227) (owner: 10Andrew Bogott) [05:32:42] (03PS4) 10Dzahn: Revert "Remove access for bmansurov" [puppet] - 10https://gerrit.wikimedia.org/r/559651 (https://phabricator.wikimedia.org/T241089) [05:34:10] (03CR) 10Dzahn: "manually rebased" [puppet] - 10https://gerrit.wikimedia.org/r/559651 (https://phabricator.wikimedia.org/T241089) (owner: 10Dzahn) [05:34:14] (03PS2) 10Andrew Bogott: Openstack keystone (ocata/stretch): pull in python-ldap from the buster repo [puppet] - 10https://gerrit.wikimedia.org/r/560360 (https://phabricator.wikimedia.org/T229227) [05:34:37] (03CR) 10jerkins-bot: [V: 04-1] Revert "Remove access for bmansurov" [puppet] - 10https://gerrit.wikimedia.org/r/559651 (https://phabricator.wikimedia.org/T241089) (owner: 10Dzahn) [05:36:25] (03PS3) 10Andrew Bogott: Openstack keystone (ocata/stretch): pull in python-ldap from the buster repo [puppet] - 10https://gerrit.wikimedia.org/r/560360 (https://phabricator.wikimedia.org/T229227) [05:38:09] 10Operations: Track services without a native systemd unit - https://phabricator.wikimedia.org/T240843 (10Dzahn) affected and installed on (almost) everything: - exim4 - acct [05:38:17] (03PS4) 10Andrew Bogott: Openstack keystone (ocata/stretch): pull in python-ldap from the buster repo [puppet] - 10https://gerrit.wikimedia.org/r/560360 (https://phabricator.wikimedia.org/T229227) [05:47:37] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10serviceops: On beta, scap can't clear opcache on some mw servers - https://phabricator.wikimedia.org/T237033 (10Dzahn) Compare the Hiera settings for the affected hosts to `hieradata/labs/deployment-prep/host/deployment-me... [05:53:37] (03PS1) 10Dzahn: xhgui: use ensure=>present instead of ensure=>latest [puppet] - 10https://gerrit.wikimedia.org/r/560364 (https://phabricator.wikimedia.org/T218900) [05:55:31] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10serviceops: On beta, scap can't clear opcache on some mw servers - https://phabricator.wikimedia.org/T237033 (10Dzahn) Also T236275#5621970 might be related. [05:59:00] (03PS1) 10Dzahn: install_server: switch contint2001 from jessie to buster [puppet] - 10https://gerrit.wikimedia.org/r/560365 (https://phabricator.wikimedia.org/T224591) [06:03:30] (03PS5) 10Dzahn: Revert "Remove access for bmansurov" [puppet] - 10https://gerrit.wikimedia.org/r/559651 (https://phabricator.wikimedia.org/T241089) [06:05:20] (03CR) 10jerkins-bot: [V: 04-1] Revert "Remove access for bmansurov" [puppet] - 10https://gerrit.wikimedia.org/r/559651 (https://phabricator.wikimedia.org/T241089) (owner: 10Dzahn) [06:08:32] (03PS6) 10Dzahn: Revert "Remove access for bmansurov" [puppet] - 10https://gerrit.wikimedia.org/r/559651 (https://phabricator.wikimedia.org/T241089) [06:24:50] (03PS5) 10Andrew Bogott: Openstack keystone (ocata/stretch): pull in python-ldap v3 [puppet] - 10https://gerrit.wikimedia.org/r/560360 (https://phabricator.wikimedia.org/T229227) [06:24:52] (03PS1) 10Andrew Bogott: Keystone hooks: monkeypatch ldap.initialize to set bytes_mode=false [puppet] - 10https://gerrit.wikimedia.org/r/560366 (https://phabricator.wikimedia.org/T229227) [06:25:48] (03CR) 10jerkins-bot: [V: 04-1] Keystone hooks: monkeypatch ldap.initialize to set bytes_mode=false [puppet] - 10https://gerrit.wikimedia.org/r/560366 (https://phabricator.wikimedia.org/T229227) (owner: 10Andrew Bogott) [06:52:16] (03PS1) 10CRusnov: tools/import-mgmt-dns.py: Add dry-run mode [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/560367 [06:53:44] (03PS2) 10CRusnov: tools/import-mgmt-dns.py: Add dry-run mode [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/560367 [07:00:27] (03PS3) 10CRusnov: tools/import-mgmt-dns.py: Add dry-run mode [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/560367 [07:38:20] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:56] (03PS7) 10Ammarpad: Add minerva custom logo for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) [07:42:06] (03PS8) 10Ammarpad: Add minerva custom logo for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) [07:48:26] (03CR) 10Ammarpad: [C: 03+1] "Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560339 (https://phabricator.wikimedia.org/T241329) (owner: 10MarcoAurelio) [07:52:04] (03CR) 10Muehlenhoff: thorium/eventlog: Switch to standard recipes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559827 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [07:53:54] (03PS2) 10Muehlenhoff: eventlog: Switch to standard recipe [puppet] - 10https://gerrit.wikimedia.org/r/559827 (https://phabricator.wikimedia.org/T156955) [07:58:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559879 (owner: 10Herron) [08:00:03] (03PS2) 10Giuseppe Lavagetto: Add class to scan a registry for images [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/559804 (https://phabricator.wikimedia.org/T241206) [08:00:05] (03PS2) 10Giuseppe Lavagetto: Add a registry reporter [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/559933 [08:01:22] (03CR) 10Giuseppe Lavagetto: "This change is ready for review." [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/559933 (owner: 10Giuseppe Lavagetto) [08:05:07] (03CR) 10Muehlenhoff: [C: 03+1] install_server: switch contint2001 from jessie to buster [puppet] - 10https://gerrit.wikimedia.org/r/560365 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [08:05:38] 10Operations, 10serviceops, 10PHP 7.2 support: (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers - https://phabricator.wikimedia.org/T231011 (10Joe) 05Open→03Resolved a:03Joe Yes, this is yet another WTF resolved by killing parsoid-js. [08:08:08] (03PS3) 10Giuseppe Lavagetto: Add a registry reporter [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/559933 [08:10:32] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:30] !log installing cyrus-sasl2 security updates [08:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:46] (03CR) 10DannyS712: [C: 03+1] "Looks fine" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560339 (https://phabricator.wikimedia.org/T241329) (owner: 10MarcoAurelio) [08:31:14] !log cp2023: depool ats-be for Lua path normalization experiment T241232 [08:31:17] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2023.codfw.wmnet,service=ats-be [08:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:21] T241232: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text - https://phabricator.wikimedia.org/T241232 [08:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:44] (03PS1) 10Muehlenhoff: Add library hint for cyrus-sasl2 [puppet] - 10https://gerrit.wikimedia.org/r/560371 [08:39:29] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for cyrus-sasl2 [puppet] - 10https://gerrit.wikimedia.org/r/560371 (owner: 10Muehlenhoff) [08:43:47] (03PS1) 10Andrew Bogott: Add initial config for Openstack Pike [puppet] - 10https://gerrit.wikimedia.org/r/560372 (https://phabricator.wikimedia.org/T241347) [08:43:49] (03PS1) 10Andrew Bogott: Openstack Designate: add manifests for Openstack Pike [puppet] - 10https://gerrit.wikimedia.org/r/560373 (https://phabricator.wikimedia.org/T241348) [08:44:30] (03CR) 10jerkins-bot: [V: 04-1] Add initial config for Openstack Pike [puppet] - 10https://gerrit.wikimedia.org/r/560372 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [08:48:20] (03PS2) 10Andrew Bogott: Add initial config for Openstack Pike [puppet] - 10https://gerrit.wikimedia.org/r/560372 (https://phabricator.wikimedia.org/T241347) [08:48:22] (03PS2) 10Andrew Bogott: Openstack Designate: add manifests for Openstack Pike [puppet] - 10https://gerrit.wikimedia.org/r/560373 (https://phabricator.wikimedia.org/T241348) [08:49:32] 10Operations, 10DC-Ops, 10decommission, 10Discovery-Search (Current work): decommission elastic10[18-31].eqiad.wmnet - https://phabricator.wikimedia.org/T239821 (10MoritzMuehlenhoff) I noticed that elastic1019 is still in puppetdb*, maybe the decom cookbook failed there? ` jmm@cumin2001:~$ sudo cumin ela... [08:50:28] !log restarting slapd on LDAP replicas to pick up SASL security update [08:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:03] (03PS6) 10Andrew Bogott: Openstack keystone (ocata/stretch): pull in python-ldap v3 [puppet] - 10https://gerrit.wikimedia.org/r/560360 (https://phabricator.wikimedia.org/T229227) [08:54:05] (03PS2) 10Andrew Bogott: Keystone hooks: monkeypatch ldap.initialize to set bytes_mode=false [puppet] - 10https://gerrit.wikimedia.org/r/560366 (https://phabricator.wikimedia.org/T229227) [08:55:01] (03CR) 10jerkins-bot: [V: 04-1] Keystone hooks: monkeypatch ldap.initialize to set bytes_mode=false [puppet] - 10https://gerrit.wikimedia.org/r/560366 (https://phabricator.wikimedia.org/T229227) (owner: 10Andrew Bogott) [08:58:49] !log restarting slapd on serpens/seaborgium to pick up SASL security update [08:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:34] (03PS1) 10Andrew Bogott: keystone/pike: remove obsolete filter from paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/560375 (https://phabricator.wikimedia.org/T241347) [09:02:36] (03PS1) 10Andrew Bogott: nova/pike: update policy.json for new Pike policy changes [puppet] - 10https://gerrit.wikimedia.org/r/560376 (https://phabricator.wikimedia.org/T241347) [09:03:49] (03PS3) 10Andrew Bogott: Keystone hooks: monkeypatch ldap.initialize to set bytes_mode=false [puppet] - 10https://gerrit.wikimedia.org/r/560366 (https://phabricator.wikimedia.org/T229227) [09:08:04] 10Operations, 10Traffic: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text - https://phabricator.wikimedia.org/T241232 (10ema) With the migration to ATS we have moved the [[ https://wikitech.wikimedia.org/wiki/URI_Path_Normalization | URI Path Normalization ]] implementation from na... [09:10:38] !log cp2023: wipe ats-be cache and repool after normalize-path.lua experiment T241232 [09:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:45] T241232: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text - https://phabricator.wikimedia.org/T241232 [09:11:11] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2023.codfw.wmnet,service=ats-be [09:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:42] (03CR) 10Muehlenhoff: Openstack keystone (ocata/stretch): pull in python-ldap v3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/560360 (https://phabricator.wikimedia.org/T229227) (owner: 10Andrew Bogott) [09:17:17] 10Operations, 10ops-codfw, 10DBA: codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10jcrespo) @Papaul sorry he will be away, I don't have a clear solution, but I believe any other place on the same row, but not the same rack as the others will do. Please note th... [09:17:56] <_joe_> !log running docker-report for base images, as a test, on boron [09:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:35] 10Operations, 10ops-codfw, 10DBA: codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10jcrespo) [09:21:53] 10Operations, 10ops-codfw, 10DBA: codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10jcrespo) [09:23:06] (03CR) 10Elukey: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1001/20116/an-airflow1001.eqiad.wmnet/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/558687 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [09:23:38] (03CR) 10Elukey: "@Moritz: do we need to go through another SRE approval for the user change in your opinion?" [puppet] - 10https://gerrit.wikimedia.org/r/558687 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [09:28:09] 10Operations, 10ops-eqiad: eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10jcrespo) [09:30:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/560323 (https://phabricator.wikimedia.org/T241310) (owner: 10BryanDavis) [09:30:36] (03CR) 10Elukey: [C: 03+1] eventlog: Switch to standard recipe [puppet] - 10https://gerrit.wikimedia.org/r/559827 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:31:19] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10jcrespo) Task has been created, @Marostegui will take care of creating the racking strategy on January, but if for some reason you are in a hurry (we are not), it will be distrib... [09:33:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:35:16] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:37:42] 10Operations, 10DC-Ops, 10decommission, 10Discovery-Search (Current work): decommission elastic10[18-31].eqiad.wmnet - https://phabricator.wikimedia.org/T239821 (10Volans) @MoritzMuehlenhoff mmmh, according to T239821#5747654 it all worked fine. LMK if I should investigate. [09:41:15] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10jcrespo) To try to close this as resolved, there it seems to be an incident report already... [09:45:29] 10Operations, 10DC-Ops, 10decommission, 10Discovery-Search (Current work): decommission elastic10[18-31].eqiad.wmnet - https://phabricator.wikimedia.org/T239821 (10MoritzMuehlenhoff) I had a quick look, it's the rare race we've seen before: The cookbook executed the "puppet node deactivate" at 13:51:32 and... [09:47:39] (03PS7) 10Arturo Borrero Gonzalez: Openstack keystone (ocata/stretch): pull in python-ldap v3 [puppet] - 10https://gerrit.wikimedia.org/r/560360 (https://phabricator.wikimedia.org/T229227) (owner: 10Andrew Bogott) [09:47:59] (03PS1) 10Elukey: admin: Backfill kerberos settings to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/560378 (https://phabricator.wikimedia.org/T237605) [09:49:42] 10Operations, 10DC-Ops, 10decommission, 10Discovery-Search (Current work): decommission elastic10[18-31].eqiad.wmnet - https://phabricator.wikimedia.org/T239821 (10Volans) Interesting, given that the new cookbook kills the hosts that was unexpected, but the cookbook is very quick so I get why it happens. M... [09:54:09] (03PS1) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Work with Google Code-in 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 [09:54:30] PROBLEM - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 4.874e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [09:55:07] 10Operations, 10DC-Ops, 10decommission, 10Discovery-Search (Current work): decommission elastic10[18-31].eqiad.wmnet - https://phabricator.wikimedia.org/T239821 (10MoritzMuehlenhoff) >>! In T239821#5760345, @Volans wrote: > Interesting, given that the new cookbook kills the hosts that was unexpected, but t... [09:57:20] (03CR) 10Muehlenhoff: "If the initial sudo rules was acked, then this seems fine without another loop through the SRE meeting, to me this seems like just a varia" [puppet] - 10https://gerrit.wikimedia.org/r/558687 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [10:01:25] (03PS1) 10Volans: sre.hosts.decommission: avoid race condition [cookbooks] - 10https://gerrit.wikimedia.org/r/560381 (https://phabricator.wikimedia.org/T239821) [10:01:25] moritzm: patch for that ^^^ [10:01:45] (03PS2) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Work with Google Code-in 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 [10:04:21] looking [10:05:16] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/560381 (https://phabricator.wikimedia.org/T239821) (owner: 10Volans) [10:05:55] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: avoid race condition [cookbooks] - 10https://gerrit.wikimedia.org/r/560381 (https://phabricator.wikimedia.org/T239821) (owner: 10Volans) [10:06:06] thanks for noticing and debugging it [10:06:23] thanks for fixing :-) [10:06:32] !log removing elastic1019 from puppetdb T239821 [10:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:43] T239821: decommission elastic10[18-31].eqiad.wmnet - https://phabricator.wikimedia.org/T239821 [10:08:11] (03Merged) 10jenkins-bot: sre.hosts.decommission: avoid race condition [cookbooks] - 10https://gerrit.wikimedia.org/r/560381 (https://phabricator.wikimedia.org/T239821) (owner: 10Volans) [10:13:05] (03PS3) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Work with Google Code-in 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 [10:19:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/560378 (https://phabricator.wikimedia.org/T237605) (owner: 10Elukey) [10:19:58] (03CR) 10Elukey: [C: 03+2] admin: Backfill kerberos settings to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/560378 (https://phabricator.wikimedia.org/T237605) (owner: 10Elukey) [10:38:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Openstack keystone (ocata/stretch): pull in python-ldap v3 [puppet] - 10https://gerrit.wikimedia.org/r/560360 (https://phabricator.wikimedia.org/T229227) (owner: 10Andrew Bogott) [10:41:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/559804 (https://phabricator.wikimedia.org/T241206) (owner: 10Giuseppe Lavagetto) [10:47:50] !log import python-ldap 3.1.0-2~bpo9+1~wmf1 into stretch-wikimedia/component/python-ldap-bpo (T229227) [10:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:57] T229227: Usernames containing unicode characters unable to authenticate to Wikitech and Horizon - https://phabricator.wikimedia.org/T229227 [10:58:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great!" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/559933 (owner: 10Giuseppe Lavagetto) [11:02:05] (03CR) 10Ammarpad: "You also have to set it ready for review by clicking "Set ready for review" button on the ui" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 (owner: 10Subscriptshoe9) [11:04:19] !log removing dubnium in ganeti T224557 [11:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:26] T224557: Migrate ldap/corp replicas to Stretch/Buster - https://phabricator.wikimedia.org/T224557 [11:05:17] !log import python-pyasn1 0.4.2-3~bpo9+1~wmf1 into stretch-wikimedia/component/python-ldap-bpo (T229227) [11:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:24] T229227: Usernames containing unicode characters unable to authenticate to Wikitech and Horizon - https://phabricator.wikimedia.org/T229227 [11:05:53] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission [11:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:53] !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [11:06:56] 10Operations: Migrate ldap/corp replicas to Stretch/Buster - https://phabricator.wikimedia.org/T224557 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: `dubnium.wikimedia.org` - dubnium.wikimedia.org (**FAIL**) - Downtimed host on Icinga - No management interface... [11:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:33] (03PS4) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Work with Google Code-in 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 (https://phabricator.wikimedia.org/T150618) [11:13:20] (03PS1) 10Muehlenhoff: Remove remaining Puppet refs of old jessie LDAP corp servers [puppet] - 10https://gerrit.wikimedia.org/r/560383 (https://phabricator.wikimedia.org/T224557) [11:35:25] (03PS5) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Work with Google Code-in 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 (https://phabricator.wikimedia.org/T150618) [11:41:12] (03PS1) 10MSantos: Change OSM replication to hourly [puppet] - 10https://gerrit.wikimedia.org/r/560385 [11:44:09] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560339 (https://phabricator.wikimedia.org/T241329) (owner: 10MarcoAurelio) [11:44:26] (03Abandoned) 10Andrew Bogott: Keystone hooks: monkeypatch ldap.initialize to set bytes_mode=false [puppet] - 10https://gerrit.wikimedia.org/r/560366 (https://phabricator.wikimedia.org/T229227) (owner: 10Andrew Bogott) [11:56:50] (03PS6) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Work with Google Code-in 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 (https://phabricator.wikimedia.org/T150618) [12:14:29] (03CR) 10Muehlenhoff: [C: 03+2] Remove remaining Puppet refs of old jessie LDAP corp servers [puppet] - 10https://gerrit.wikimedia.org/r/560383 (https://phabricator.wikimedia.org/T224557) (owner: 10Muehlenhoff) [12:22:24] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [12:31:55] (03PS1) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Fix IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) [12:33:16] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [12:34:20] (03CR) 10jerkins-bot: [V: 04-1] Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Fix IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [12:43:54] (03PS1) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Fix File Name. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560388 (https://phabricator.wikimedia.org/T560386) [12:58:12] (03PS7) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Work with Google Code-in 2019 Fix File Name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 (https://phabricator.wikimedia.org/T150618) [13:00:03] (03Abandoned) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Fix File Name. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560388 (https://phabricator.wikimedia.org/T560386) (owner: 10Subscriptshoe9) [13:01:45] !log uploaded libvpx 1.7.0-3+wmf2 to component/vp9 [13:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:40] (03PS2) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Fix IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) [13:05:23] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:06:11] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [13:06:42] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:06:44] 10Operations, 10Patch-For-Review: Migrate ldap/corp replicas to Stretch/Buster - https://phabricator.wikimedia.org/T224557 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete. The new systems based on buster are ldap-corp1001 and ldap-corp2001 and dubnium/pollux have been decomissioned. [13:07:11] (03CR) 10jerkins-bot: [V: 04-1] Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Fix IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [13:09:36] (03PS3) 10Muehlenhoff: eventlog: Switch to standard recipe [puppet] - 10https://gerrit.wikimedia.org/r/559827 (https://phabricator.wikimedia.org/T156955) [13:11:33] (03PS3) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) [13:12:06] (03CR) 10Muehlenhoff: [C: 03+2] eventlog: Switch to standard recipe [puppet] - 10https://gerrit.wikimedia.org/r/559827 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:12:20] (03PS4) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote&cywikiquote Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) [13:15:18] (03PS5) 10Subscriptshoe9: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) [13:15:49] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [13:17:01] (03CR) 10jerkins-bot: [V: 04-1] Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [13:17:09] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [13:22:35] (03PS6) 10Subscriptshoe9: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) [13:23:01] !log installing libvorbis security updates [13:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:20] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [13:25:23] (03CR) 10jerkins-bot: [V: 04-1] Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [13:29:55] (03PS7) 10Subscriptshoe9: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) [13:30:23] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [13:31:34] (03CR) 10jerkins-bot: [V: 04-1] Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [13:31:43] !log installing NSS security updates on jessie [13:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:35:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:36:05] (03PS8) 10Subscriptshoe9: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) [13:37:26] (03CR) 10Urbanecm: [C: 04-1] Add HD logos to IS.php (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [13:40:02] (03PS9) 10Subscriptshoe9: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) [13:40:07] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [13:42:20] !log installing cpio security updates on jessie [13:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:17] (03PS2) 10CDanis: puppet-merge.py: SHA1 or explicit FETCH_HEAD is mandatory [puppet] - 10https://gerrit.wikimedia.org/r/559944 [13:49:38] (03PS8) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote and cywikiquote. Fix File Name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 (https://phabricator.wikimedia.org/T150618) [13:51:31] (03PS9) 10Subscriptshoe9: Upload HD Logo for fawikivoyage, jawikiquote and cywikiquote. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 (https://phabricator.wikimedia.org/T150618) [13:52:08] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [13:54:19] (03PS10) 10Subscriptshoe9: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) [13:56:44] (03PS11) 10Subscriptshoe9: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) [13:59:53] (03CR) 10Urbanecm: [C: 03+1] "Looks good, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [14:00:17] (03CR) 10Urbanecm: [C: 03+1] "Looks good, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [14:40:38] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1123 - https://phabricator.wikimedia.org/T240534 (10Jclark-ctr) drive changed [14:46:52] (03PS1) 10Muehlenhoff: Ship the migrations file from django_cas to create CAS-related tables [software/debmonitor] - 10https://gerrit.wikimedia.org/r/560395 [14:49:35] (03CR) 10jerkins-bot: [V: 04-1] Ship the migrations file from django_cas to create CAS-related tables [software/debmonitor] - 10https://gerrit.wikimedia.org/r/560395 (owner: 10Muehlenhoff) [14:50:07] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1123 - https://phabricator.wikimedia.org/T240534 (10jcrespo) ` root@db1123:~$ megacli -PDRbld -ShowProg -PhysDrv [32:9] -aALL Rebuild Progress on Device at Enclosure 32, Slot 9 Completed 12% in 9 Minutes. Exit Code: 0... [14:50:29] moritzm: why do we need this migrations manually? [14:50:38] the dependency should install anything it needs [14:56:20] it doesn't seem to get picked up by "migrate" by default unless explicitly added, maybe because it's run time-enabled depending on config settings [14:57:22] where? [14:57:27] locally, on the test env, or in prod [14:58:21] test env, but migrate ran there in the mean time with the shipped migration file, but I can revert the application of the migration [14:59:19] we shouldn't include other apps migration in our code, something else is wrong/missing [15:00:53] !log jynus@cumin1001 dbctl commit (dc=all): 'Reducing db1126 main s8 weight, seems flapping', diff saved to https://phabricator.wikimedia.org/P10012 and previous config saved to /var/cache/conftool/dbconfig/20191223-150052-jynus.json [15:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:37] !log fdans@deploy1001 Started deploy [analytics/refinery@531752b]: deploying refinery [15:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:08] (03PS1) 10Elukey: airflow: move keytab to a more consistent naming [puppet] - 10https://gerrit.wikimedia.org/r/560398 [15:06:54] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) I believe I already defined the racking proposal at T235659 [15:07:56] (03PS1) 10Elukey: Rename an-airflow1001's keytab [labs/private] - 10https://gerrit.wikimedia.org/r/560399 [15:08:08] (03CR) 10Elukey: [V: 03+2 C: 03+2] Rename an-airflow1001's keytab [labs/private] - 10https://gerrit.wikimedia.org/r/560399 (owner: 10Elukey) [15:08:36] volans: ack, I'll revert in the test instance and have a deeper look [15:08:43] 10Operations, 10ops-codfw, 10DBA: codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Marostegui) >>! In T241336#5760280, @jcrespo wrote: > @Papaul sorry he will be away, I don't have a clear solution, but I believe any other place on the same row, but not the sam... [15:09:18] (03CR) 10Elukey: [C: 03+2] airflow: move keytab to a more consistent naming [puppet] - 10https://gerrit.wikimedia.org/r/560398 (owner: 10Elukey) [15:10:02] moritzm: migrations seems to be there: /srv/deployment/debmonitor/venv/lib/python3.5/site-packages/django_cas_ng/migrations [15:10:04] (03PS4) 10Elukey: airflow: Enable kerberos configuration [puppet] - 10https://gerrit.wikimedia.org/r/558687 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [15:10:46] !log fdans@deploy1001 Finished deploy [analytics/refinery@531752b]: deploying refinery (duration: 08m 09s) [15:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:01] yeah, it's weird, "showmigrations" even shows it as applied, debugging [15:11:10] ack [15:11:15] (03CR) 10Elukey: [C: 03+2] airflow: Enable kerberos configuration [puppet] - 10https://gerrit.wikimedia.org/r/558687 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [15:13:45] moritzm: wild guess without looking, maybe we're appending cas to the installed apps too late? [15:22:45] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:23:21] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1123 - https://phabricator.wikimedia.org/T240534 (10jcrespo) @Marostegui I've noticed this server has a strip size of 64K. I think as a rule we should audit the RAID configuration on first setup, as it cannot be changed without destroying all data on disks... [15:23:48] (03PS1) 10Elukey: airflow: fix kerberos configuration [puppet] - 10https://gerrit.wikimedia.org/r/560406 [15:25:37] (03CR) 10Elukey: [C: 03+2] airflow: fix kerberos configuration [puppet] - 10https://gerrit.wikimedia.org/r/560406 (owner: 10Elukey) [15:25:37] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10jcrespo) [15:29:44] (03PS9) 10Jhedden: ceph: add support for dedicated cluster network [puppet] - 10https://gerrit.wikimedia.org/r/559620 (https://phabricator.wikimedia.org/T240965) [15:41:07] (03PS10) 10Jhedden: ceph: add support for dedicated cluster network [puppet] - 10https://gerrit.wikimedia.org/r/559620 (https://phabricator.wikimedia.org/T240965) [15:42:19] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:44:50] volans: after some digging turns out I simply had some unclean/broken/interim state in the test environment; after I manually deleted the cas_ng entry from the django_migrations table it was correctly marked as not yet applied and running run-django-command migrate correctly created the session/TGT tables [15:45:04] I'll abandon the patch, thanks for nudging me on the right path :-) [15:45:09] (03PS11) 10Jhedden: ceph: add support for dedicated cluster network [puppet] - 10https://gerrit.wikimedia.org/r/559620 (https://phabricator.wikimedia.org/T240965) [15:45:45] moritzm: great, glad to head it was a red herring and all works! [15:45:56] (03Abandoned) 10Muehlenhoff: Ship the migrations file from django_cas to create CAS-related tables [software/debmonitor] - 10https://gerrit.wikimedia.org/r/560395 (owner: 10Muehlenhoff) [15:45:57] RECOVERY - MegaRAID on db1123 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:47:17] (03CR) 10Jhedden: [C: 03+2] ceph: add support for dedicated cluster network [puppet] - 10https://gerrit.wikimedia.org/r/559620 (https://phabricator.wikimedia.org/T240965) (owner: 10Jhedden) [15:47:27] (03CR) 10Jhedden: [C: 03+2] "PCC results https://puppet-compiler.wmflabs.org/compiler1002/20121/" [puppet] - 10https://gerrit.wikimedia.org/r/559620 (https://phabricator.wikimedia.org/T240965) (owner: 10Jhedden) [15:57:45] !log shut down ms-fe2007 for NIC replacement [15:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:56] (03CR) 10Giuseppe Lavagetto: "A (somewhat) superficial look at the code seems good to me, but I have one unresolved doubt about the approach you took. I'll get deeper i" (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/559952 (owner: 10RLazarus) [16:14:35] (03PS1) 10Papaul: DHCP: Change MAC address for ms-fe2007 [puppet] - 10https://gerrit.wikimedia.org/r/560409 (https://phabricator.wikimedia.org/T239805) [16:15:33] 10Operations, 10ops-codfw, 10Patch-For-Review: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10Papaul) @fgiunchedi NIC replaced new MAC address F4:E9:D4:95:61:40 [16:17:04] (03PS1) 10Jhedden: ceph: allow lvs traffic to manager exporter [puppet] - 10https://gerrit.wikimedia.org/r/560410 (https://phabricator.wikimedia.org/T240715) [16:22:41] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 64 probes of 509 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:23:09] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:24:14] (03CR) 10Jhedden: [C: 03+2] "PCC results: https://puppet-compiler.wmflabs.org/compiler1001/20122/" [puppet] - 10https://gerrit.wikimedia.org/r/560410 (https://phabricator.wikimedia.org/T240715) (owner: 10Jhedden) [16:24:19] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:27:55] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 28 probes of 509 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:28:27] (03CR) 10Jhedden: "@vgutierrez I've opened port 9283 on the backend hosts to public1-b-eqiad." [puppet] - 10https://gerrit.wikimedia.org/r/559110 (https://phabricator.wikimedia.org/T240715) (owner: 10Jhedden) [16:31:10] (03CR) 10Vgutierrez: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/559110 (https://phabricator.wikimedia.org/T240715) (owner: 10Jhedden) [16:33:09] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:34:33] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:41:25] (03CR) 10Vgutierrez: "I don't know if this part of another CR, but I cannot find the discovery part and lvs::realserver::realserver_ips puppetization, see https" [puppet] - 10https://gerrit.wikimedia.org/r/559110 (https://phabricator.wikimedia.org/T240715) (owner: 10Jhedden) [16:45:07] (03PS3) 10Giuseppe Lavagetto: Add class to scan a registry for images [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/559804 (https://phabricator.wikimedia.org/T241206) [16:45:09] (03PS4) 10Giuseppe Lavagetto: Add a registry reporter [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/559933 [17:12:22] 10Operations, 10Security-Team, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236 (10chasemp) [17:13:01] 10Operations, 10Security-Team, 10observability, 10Patch-For-Review: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300 (10chasemp) [17:13:05] 10Operations, 10Release-Engineering-Team-TODO, 10Security-Team, 10Release-Engineering-Team (Deployment services), 10User-greg: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270 (10chasemp) [17:13:41] 10Operations, 10Security-Team, 10Wikimedia-General-or-Unknown, 10WorkType-NewFunctionality: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860 (10chasemp) [17:13:44] (03PS1) 10Elukey: admin: add kerberos flag to Chase's user metadata [puppet] - 10https://gerrit.wikimedia.org/r/560415 (https://phabricator.wikimedia.org/T241370) [17:14:32] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag to Chase's user metadata [puppet] - 10https://gerrit.wikimedia.org/r/560415 (https://phabricator.wikimedia.org/T241370) (owner: 10Elukey) [17:20:07] (03PS1) 10Volans: images: fix authentication [software/debmonitor] - 10https://gerrit.wikimedia.org/r/560416 [17:26:04] 10Operations, 10ops-codfw, 10DBA: codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Papaul) @jcrespo thanks [17:27:57] 10Operations, 10ops-codfw, 10DBA: codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Papaul) @jcrespo we will have to move es2024: Row A rack A4 too since A4 is also a 10G rack. Thanks. [17:32:52] 10Operations, 10Puppet, 10Patch-For-Review: puppet-merge can't accept an explicit SHA1 for an --ops merge - https://phabricator.wikimedia.org/T241277 (10Crutishnyk) Sorry, I did a mistake when writing a commit message. It was for T241227 [17:36:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The code looks fine, but I think it introduces some compromises that we might have to pay down the line. Specifically, with the current ap" (033 comments) [software/httpbb] - 10https://gerrit.wikimedia.org/r/559952 (owner: 10RLazarus) [17:47:53] 10Operations, 10netops: fastnetmon misreports attack type and protocol - https://phabricator.wikimedia.org/T241374 (10CDanis) [17:52:55] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10wiki_willy) [17:53:20] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10wiki_willy) a:03Jclark-ctr [17:53:26] (03CR) 10Jdlrobson: [C: 03+1] "This is ready to SWAT. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [17:58:44] 10Operations, 10ops-esams: Terminate OE10,11,12,13 Racks - https://phabricator.wikimedia.org/T237055 (10wiki_willy) Followed up with Jim last week on this, and sorted out a few questions he had with the existing contract and what we would like to include in the termination letter. Will update again, when more... [18:40:05] 10Operations, 10SDC General, 10Structured Data Engineering, 10Structured-Data-Backlog, and 2 others: Create CQS puppet configs by applying query_service module - https://phabricator.wikimedia.org/T237089 (10Igorkim78) The configuration changes for SDC data are as follows (note that namespace 'sdc' is used... [18:54:51] !log Deleting mgmt IP addresses from Netbox that are connected to offline devices. T228387 [18:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:59] T228387: Bare metal cloud: management interfaces - https://phabricator.wikimedia.org/T228387 [19:28:13] (03PS1) 10Volans: dns: create only one mgmt asset tag record [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/560426 [19:28:48] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/560426 (owner: 10Volans) [19:31:21] (03CR) 10Volans: [C: 03+2] dns: create only one mgmt asset tag record [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/560426 (owner: 10Volans) [20:11:08] 10Operations, 10ops-codfw, 10DBA: codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Marostegui) >>! In T241336#5761081, @Papaul wrote: > @jcrespo we will have to move es2024: Row A rack A4 too since A4 is also a 10G rack. > > Thanks. Anywhere within row A is... [20:12:05] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) Partitioning recipe was also already defined in puppet some days ago so we are good on that front too. Thanks guys [20:36:14] (03PS1) 10Jhedden: ceph: ignore ceph osds in nagios check_disk [puppet] - 10https://gerrit.wikimedia.org/r/560431 (https://phabricator.wikimedia.org/T240722) [20:40:31] (03PS2) 10Jhedden: ceph: ignore ceph osds in nagios check_disk [puppet] - 10https://gerrit.wikimedia.org/r/560431 (https://phabricator.wikimedia.org/T240722) [20:43:58] (03PS3) 10Jhedden: ceph: ignore ceph osds in nagios check_disk [puppet] - 10https://gerrit.wikimedia.org/r/560431 (https://phabricator.wikimedia.org/T240722) [20:46:01] (03CR) 10Jhedden: [C: 03+2] "PCC results: https://puppet-compiler.wmflabs.org/compiler1002/20125/" [puppet] - 10https://gerrit.wikimedia.org/r/560431 (https://phabricator.wikimedia.org/T240722) (owner: 10Jhedden) [21:10:19] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:10:31] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:16:31] (03PS2) 10Gehel: Change OSM replication to hourly [puppet] - 10https://gerrit.wikimedia.org/r/560385 (owner: 10MSantos) [21:18:37] (03CR) 10Gehel: [C: 03+2] Change OSM replication to hourly [puppet] - 10https://gerrit.wikimedia.org/r/560385 (owner: 10MSantos) [21:33:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:35:15] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:35:42] (03PS1) 10Gehel: maps: fix type constraints for OSM crons [puppet] - 10https://gerrit.wikimedia.org/r/560441 [21:37:49] (03CR) 10Gehel: [C: 03+2] maps: fix type constraints for OSM crons [puppet] - 10https://gerrit.wikimedia.org/r/560441 (owner: 10Gehel) [21:50:49] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:57:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:03:55] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:04:01] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:17:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:37:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:39:41] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops