[00:01:08] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:07] (03CR) 10DannyS712: Grant oathauth-disable-for-user and oathauth-verify-user to wmf-supportsafety at Meta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650988 (https://phabricator.wikimedia.org/T180896) (owner: 10Urbanecm) [00:26:13] (03PS3) 10Urbanecm: Grant several OATHAuth-related permissions to wmf-supportsafety at Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650988 (https://phabricator.wikimedia.org/T180896) [00:26:58] (03CR) 10Urbanecm: Grant several OATHAuth-related permissions to wmf-supportsafety at Meta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650988 (https://phabricator.wikimedia.org/T180896) (owner: 10Urbanecm) [00:42:50] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2966605208 and 154 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:00] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 532007656 and 107 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:00] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 9564481008 and 619 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:26] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 620877320 and 135 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:18] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2914622416 and 288 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:32] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 24512 and 175 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:58] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 245176 and 201 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:02] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3390982424 and 359 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:02] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2560197240 and 311 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:43:50] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 1272 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:45:20] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:34:28] (03PS1) 10Andrew Bogott: Openstack galera cluster: move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/651012 (https://phabricator.wikimedia.org/T270552) [03:34:55] (03CR) 10Andrew Bogott: [C: 04-1] "do not merge -- merging requires steps described in the attached phab task" [puppet] - 10https://gerrit.wikimedia.org/r/651012 (https://phabricator.wikimedia.org/T270552) (owner: 10Andrew Bogott) [04:11:51] (03CR) 10Gergő Tisza: [C: 03+1] labs: bnwiki: Fix a typo in wgGEHelpPanelLinks config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650948 (https://phabricator.wikimedia.org/T270578) (owner: 10Urbanecm) [05:06:20] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:44] !log Compress clouddb1017:3313 clouddb1013:3313 T270473 [06:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:49] T270473: Ensure InnoDB is compressed on the new clouddb hosts - https://phabricator.wikimedia.org/T270473 [06:40:14] (03PS1) 10Marostegui: mariadb: Decommission es1013 [puppet] - 10https://gerrit.wikimedia.org/r/651018 (https://phabricator.wikimedia.org/T268436) [06:48:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1013.eqiad.wmnet - https://phabricator.wikimedia.org/T268436 (10Marostegui) a:05LSobanski→03wiki_willy [06:53:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1013.eqiad.wmnet - https://phabricator.wikimedia.org/T268436 (10Marostegui) Host read for #dc-ops [06:55:51] (03CR) 10Elukey: [C: 03+2] druid: Migrate hiera() to lookup() and setting datatype in middlemanager [puppet] - 10https://gerrit.wikimedia.org/r/650993 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [06:56:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission es1013 [puppet] - 10https://gerrit.wikimedia.org/r/651018 (https://phabricator.wikimedia.org/T268436) (owner: 10Marostegui) [06:56:54] marostegui: hola, shall I merge? :) [06:57:06] elukey: yes please! I am also seeing a change from Amir1 [06:57:20] ah then proceed! [06:57:32] elukey: I am not sure if we can merge though [06:57:34] I am checking what it is [06:57:51] marostegui: nono it is mine, I merged it [06:57:55] aaaah sorry [06:57:58] it is part of the move to llookup/types [06:58:11] Ah, as the user was Amir I got confused :) [07:07:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 T268742 ', diff saved to https://phabricator.wikimedia.org/P13609 and previous config saved to /var/cache/conftool/dbconfig/20201221-070748-marostegui.json [07:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:52] T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 [07:21:13] (03PS1) 10Effie Mouzeli: hiera: upgrade mc1023, mc2023 to buster [puppet] - 10https://gerrit.wikimedia.org/r/651019 (https://phabricator.wikimedia.org/T213089) [07:32:30] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: upgrade mc1023, mc2023 to buster [puppet] - 10https://gerrit.wikimedia.org/r/651019 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [07:34:01] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc1023.eqiad.wmnet ` The log can be... [07:34:13] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc2023.codfw.wmnet ` The log can be... [07:47:51] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1023.eqiad.wmnet with reason: REIMAGE [07:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:07] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2023.codfw.wmnet with reason: REIMAGE [07:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:34] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [07:49:55] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1023.eqiad.wmnet with reason: REIMAGE [07:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:01] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2023.codfw.wmnet with reason: REIMAGE [07:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:03] (03CR) 10Muehlenhoff: add platform engineering folks to snapshot and dumpsdata server access (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/649077 (owner: 10ArielGlenn) [07:57:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/650010 (https://phabricator.wikimedia.org/T241195) (owner: 10Legoktm) [08:02:58] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1101 - https://phabricator.wikimedia.org/T270571 (10wiki_willy) a:03Jclark-ctr @Jclark-ctr or @Cmjohnson - do we have any decom'd servers onsite with this drive size? Thanks, Willy [08:06:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission es1013.eqiad.wmnet - https://phabricator.wikimedia.org/T268436 (10wiki_willy) a:05wiki_willy→03Cmjohnson [08:09:10] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) >>! In T260445#6694591, @elukey wrote: > @Cmjohnson what do you think about the last proposal? @Cmjohnson asking again just to get he conf... [08:12:08] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1023.eqiad.wmnet'] ` and were **ALL** successful. [08:15:05] !log Add ips to the x2 instances on dbctl T269324 [08:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:09] T269324: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 [08:22:49] !log dcausse@deploy1001 Started deploy [wdqs/wdqs@512d713]: GUI updates (T269224+i18n updates) [08:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:52] T269224: Track outcome of Query Builder queries via embedded query service js - https://phabricator.wikimedia.org/T269224 [08:31:46] !log dcausse@deploy1001 Finished deploy [wdqs/wdqs@512d713]: GUI updates (T269224+i18n updates) (duration: 08m 57s) [08:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:50] T269224: Track outcome of Query Builder queries via embedded query service js - https://phabricator.wikimedia.org/T269224 [08:32:00] cc Amir1 ^ [08:34:28] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2023.codfw.wmnet'] ` and were **ALL** successful. [09:16:03] (03PS1) 10Muehlenhoff: Remove obsolete (and expired) repository key for Tor [puppet] - 10https://gerrit.wikimedia.org/r/651156 (https://phabricator.wikimedia.org/T269861) [09:19:14] <_joe_> !log powercycling wdqs1011, unresponsive to ssh [09:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:06] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:22:16] RECOVERY - WDQS HTTP on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 17017 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:22:24] RECOVERY - WDQS SPARQL on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 17017 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:22:40] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1011 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:22:46] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1011 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:22:58] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:23:06] RECOVERY - SSH on wdqs1011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:23:06] PROBLEM - Query Service HTTP Port on wdqs1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:23:32] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:42] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1011 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:23:44] PROBLEM - WDQS high update lag on wdqs1011 is CRITICAL: 1.787e+05 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:24:18] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1011 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:24:24] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1011 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:24:36] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1011 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:24:46] RECOVERY - Query Service HTTP Port on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:24:50] !log depooling wdqs1011 (lag) [09:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:42] <_joe_> dcausse: uhm shouldn't we have a pybal check that auto-depools servers with high lag? [09:29:14] <_joe_> like, wdqs should expose a /healthz endpoint that returns non-200 if lag exceeds N minutes [09:29:34] _joe_: indeed [09:29:35] <_joe_> in case all servers are lagged, pybal would only depool a few [09:30:04] makes sense, will file a task [09:31:11] concerning why it went down I see nothing except a bunch of NUL chars in syslog, I suppose it "simply" crashed [09:31:18] (03PS1) 10Muehlenhoff: Install php-readline from component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/651158 (https://phabricator.wikimedia.org/T245757) [09:31:31] (03PS2) 10Muehlenhoff: Install php-readline from component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/651158 (https://phabricator.wikimedia.org/T245757) [09:36:23] <_joe_> yeah [09:41:26] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:23] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27222/" [puppet] - 10https://gerrit.wikimedia.org/r/651158 (https://phabricator.wikimedia.org/T245757) (owner: 10Muehlenhoff) [09:44:40] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:21] <_joe_> !log systemctl reset-failed on deneb, timeout downloading a docker image from the registry [09:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:41] <_joe_> !log logging out of the long-running root screen session on maps1001 [09:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:35] <_joe_> !log logging out of the long-running root screen session on maps1010 [09:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:56] 10Operations, 10serviceops, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10Volans) p:05Triage→03Medium [10:00:01] 10Operations, 10Dumps-Generation, 10Platform Engineering, 10serviceops: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ArielGlenn) Preliminaries: - build mwbzutils package for buster and make sure it passes all tests [10:01:39] 10Operations, 10Growth-Team, 10Mail, 10Notifications, and 2 others: SRE query: Is it possible to measure how many e-mails are sent to "black hole" e-mail addresses? - https://phabricator.wikimedia.org/T202329 (10Volans) p:05Triage→03Low [10:05:38] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10observability, 10User-DannyS712: Beta cluster logstash down - https://phabricator.wikimedia.org/T268200 (10Volans) [10:06:52] (03PS1) 10ArielGlenn: version 0.1.2 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/651162 [10:10:15] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10observability, 10User-DannyS712: Beta cluster logstash down - https://phabricator.wikimedia.org/T268200 (10DannyS712) [10:11:51] (03CR) 10Arturo Borrero Gonzalez: "per comments on T267966 I think etcd is more sensitive to the underlying ceph status. I'm not sure if this would make any actual differenc" [puppet] - 10https://gerrit.wikimedia.org/r/650470 (https://phabricator.wikimedia.org/T267966) (owner: 10Arturo Borrero Gonzalez) [10:26:40] RECOVERY - Long running screen/tmux on maps1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [10:32:22] RECOVERY - Long running screen/tmux on maps1010 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [10:45:39] (03PS15) 10Jcrespo: Add proof of concept for retrieving metadata and backing up media [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 (https://phabricator.wikimedia.org/T264189) [10:46:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, minus the missing comma in etcd::v3's params." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [10:56:30] 10Operations, 10wmf-sre-laptop: distribute tunnelencabulator in wmf-sre-laptop - https://phabricator.wikimedia.org/T266784 (10Majavah) [10:58:45] (03PS1) 10Jcrespo: Add proof of concept for retrieving metadata and backing up media [software/wmfbackups] (media-backups) - 10https://gerrit.wikimedia.org/r/651164 (https://phabricator.wikimedia.org/T264189) [10:59:21] (03Abandoned) 10Jcrespo: Add proof of concept for retrieving metadata and backing up media [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 (https://phabricator.wikimedia.org/T264189) (owner: 10Jcrespo) [11:01:53] (03PS3) 10Jbond: varnish: ratelimit vscode-phabricator plugin [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) [11:01:55] (03CR) 10Jcrespo: [C: 03+2] Add proof of concept for retrieving metadata and backing up media [software/wmfbackups] (media-backups) - 10https://gerrit.wikimedia.org/r/651164 (https://phabricator.wikimedia.org/T264189) (owner: 10Jcrespo) [11:01:57] (03CR) 10Jbond: "updated thanks" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) (owner: 10Jbond) [11:02:35] (03Merged) 10jenkins-bot: Add proof of concept for retrieving metadata and backing up media [software/wmfbackups] (media-backups) - 10https://gerrit.wikimedia.org/r/651164 (https://phabricator.wikimedia.org/T264189) (owner: 10Jcrespo) [11:10:05] dcausse: thanks! [11:12:28] yw! :) [11:17:11] (03CR) 10David Caro: [C: 03+2] [wmcs][backups] Add project and vm info [puppet] - 10https://gerrit.wikimedia.org/r/650141 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [11:17:24] (03CR) 10David Caro: [C: 03+2] [wmcs][backups] Add cli see where a project/vm is backed up [puppet] - 10https://gerrit.wikimedia.org/r/650496 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [11:17:43] (03CR) 10David Caro: [C: 03+2] [wmcs][backup] Added command to show a project [puppet] - 10https://gerrit.wikimedia.org/r/650497 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [11:18:56] (03PS2) 10David Caro: [wmcs][backup] Add command to remove/print dangling snapshots [puppet] - 10https://gerrit.wikimedia.org/r/650535 (https://phabricator.wikimedia.org/T270478) [11:19:07] (03PS2) 10David Caro: [wmcs][backup] Remove all temp files after usage [puppet] - 10https://gerrit.wikimedia.org/r/650542 (https://phabricator.wikimedia.org/T270478) [11:28:34] (03CR) 10Arturo Borrero Gonzalez: Allow specific flows from 172.16/12 to prod (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/643269 (https://phabricator.wikimedia.org/T209082) (owner: 10Ayounsi) [11:31:14] !log installing php-pear security updates on buster [11:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:09] (03PS2) 10Elukey: Port IRCSocketHandler from Spickerack and create irc_utils.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650546 (https://phabricator.wikimedia.org/T257905) [11:39:53] (03CR) 10David Caro: [C: 03+1] OpenStack haproxy: make logs much, much quieter [puppet] - 10https://gerrit.wikimedia.org/r/650943 (https://phabricator.wikimedia.org/T270554) (owner: 10Andrew Bogott) [11:41:43] (03CR) 10David Caro: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/650470 (https://phabricator.wikimedia.org/T267966) (owner: 10Arturo Borrero Gonzalez) [11:42:18] (03PS1) 10David Caro: wmcs.backups: Add a images summary command [puppet] - 10https://gerrit.wikimedia.org/r/651166 [11:42:47] (03CR) 10jerkins-bot: [V: 04-1] wmcs.backups: Add a images summary command [puppet] - 10https://gerrit.wikimedia.org/r/651166 (owner: 10David Caro) [11:46:42] (03PS2) 10David Caro: wmcs.backups: Add a images summary command [puppet] - 10https://gerrit.wikimedia.org/r/651166 [11:48:13] !log installing libxstream-java security updates on buster [11:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:59] (03CR) 10Arturo Borrero Gonzalez: Allow specific flows from 172.16/12 to prod (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/643269 (https://phabricator.wikimedia.org/T209082) (owner: 10Ayounsi) [12:14:23] (03PS3) 10Elukey: Port IRCSocketHandler from Spickerack and create irc_utils.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650546 (https://phabricator.wikimedia.org/T257905) [12:16:25] (03PS4) 10Elukey: Port IRCSocketHandler from Spickerack and create irc.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650546 (https://phabricator.wikimedia.org/T257905) [12:19:47] (03PS5) 10Elukey: Port IRCSocketHandler from Spickerack and create irc.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650546 (https://phabricator.wikimedia.org/T257905) [12:22:41] !log Running jhat on gerrit1001 to analyze a heap dump, expect CPU usage [12:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:29] (03PS1) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud: expand dmz_cidr list for public endpoints [puppet] - 10https://gerrit.wikimedia.org/r/651169 (https://phabricator.wikimedia.org/T209082) [12:29:01] (03PS1) 10Jbond: varnish: add phabricator specific ban in varnish [puppet] - 10https://gerrit.wikimedia.org/r/651171 (https://phabricator.wikimedia.org/T270618) [12:30:11] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Create Generalised blocking stratagy - https://phabricator.wikimedia.org/T270618 (10jbond) [12:40:00] (03PS1) 10Volans: tests: fix deprecated pytest argument [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651172 [12:40:04] (03PS1) 10Volans: interactive: improve confirmation capabilities [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651173 [12:43:33] (03CR) 10Jbond: "See inline for questions" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/651171 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [12:46:49] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Marking as -1 to prevent accidental merge." [puppet] - 10https://gerrit.wikimedia.org/r/651169 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [12:54:20] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero) [12:54:34] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero) [12:54:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gerrit,gerrit-metrics} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:55:26] PROBLEM - SSH access on gerrit1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit [12:55:49] hashar: is this due to your work ^^^ [12:56:30] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [12:56:34] yeah [12:56:45] (03PS1) 10Jbond: varnish: migrate abuse_nets acl to abuse_networks hiera block [puppet] - 10https://gerrit.wikimedia.org/r/651174 (https://phabricator.wikimedia.org/T193762) [12:56:52] RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_3.2.5-2-gade85f3c32 (APACHE-SSHD-2.4.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [12:56:56] ack thx [12:57:03] my bad sorry :-\ [12:57:18] no worries seems back now :) [12:57:58] !log Gerrit briefly paused due to erroneous run of `jmap -clstats` [12:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:00] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 28706 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [12:58:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:58:15] I ran those on my local machine but the instance is wayy wayyy smaller [13:00:51] (03PS2) 10Jbond: varnish: migrate abuse_nets acl to abuse_networks hiera block [puppet] - 10https://gerrit.wikimedia.org/r/651174 (https://phabricator.wikimedia.org/T193762) [13:02:13] (03PS1) 10David Caro: wmcs.backups: move wikidumpparse to cloudvirt1025 [puppet] - 10https://gerrit.wikimedia.org/r/651175 [13:02:37] (03CR) 10jerkins-bot: [V: 04-1] wmcs.backups: move wikidumpparse to cloudvirt1025 [puppet] - 10https://gerrit.wikimedia.org/r/651175 (owner: 10David Caro) [13:22:21] (03CR) 10Jbond: P:toolforge: migrate to ensure_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639826 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [13:25:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650546 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [13:30:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: After cloning db1154:3313', diff saved to https://phabricator.wikimedia.org/P13613 and previous config saved to /var/cache/conftool/dbconfig/20201221-133044-root.json [13:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:15] (03CR) 10ArielGlenn: add platform engineering folks to snapshot and dumpsdata server access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/649077 (owner: 10ArielGlenn) [13:35:07] (03CR) 10Hashar: [C: 03+1] varnish: ratelimit vscode-phabricator plugin [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) (owner: 10Jbond) [13:39:51] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Create Generalised blocking stratagy - https://phabricator.wikimedia.org/T270618 (10Volans) p:05Triage→03Medium [13:43:07] (03CR) 10Hashar: [C: 03+1] "Looks good to me! Thank you for working on migrating the ban list to the edge!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [13:45:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: After cloning db1154:3313', diff saved to https://phabricator.wikimedia.org/P13614 and previous config saved to /var/cache/conftool/dbconfig/20201221-134548-root.json [13:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:59] (03CR) 10Volans: [C: 03+1] "LGTM, couple of nits inline, can be merged as is too." (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650546 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [14:00:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: After cloning db1154:3313', diff saved to https://phabricator.wikimedia.org/P13615 and previous config saved to /var/cache/conftool/dbconfig/20201221-140051-root.json [14:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:28] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:21] (03PS1) 10Jbond: puppetmaster: dont't install puppet master as we use puppet-master-passanger [puppet] - 10https://gerrit.wikimedia.org/r/651182 [14:15:03] !log installung sleuthkit security updates on buster [14:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:33] (03CR) 10Jbond: [C: 03+2] puppetmaster: dont't install puppet master as we use puppet-master-passanger [puppet] - 10https://gerrit.wikimedia.org/r/651182 (owner: 10Jbond) [14:15:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: After cloning db1154:3313', diff saved to https://phabricator.wikimedia.org/P13616 and previous config saved to /var/cache/conftool/dbconfig/20201221-141555-root.json [14:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:15] !log update puppet on puppetmaster1003 [14:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:24] !log update puppet on puppetmaster1001 [14:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:32] 10Operations: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) [14:23:47] (03CR) 10RLazarus: [C: 03+1] Install php-readline from component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/651158 (https://phabricator.wikimedia.org/T245757) (owner: 10Muehlenhoff) [14:28:46] (03CR) 10Elukey: Port IRCSocketHandler from Spickerack and create irc.py (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650546 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [14:29:11] (03CR) 10Elukey: [C: 03+1] tests: fix deprecated pytest argument [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651172 (owner: 10Volans) [14:36:54] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack haproxy: make logs much, much quieter [puppet] - 10https://gerrit.wikimedia.org/r/650943 (https://phabricator.wikimedia.org/T270554) (owner: 10Andrew Bogott) [14:39:55] (03CR) 10Muehlenhoff: add platform engineering folks to snapshot and dumpsdata server access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/649077 (owner: 10ArielGlenn) [14:41:53] (03CR) 10Elukey: [C: 03+1] interactive: improve confirmation capabilities [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651173 (owner: 10Volans) [14:42:37] (03CR) 10Volans: [C: 03+1] Port IRCSocketHandler from Spickerack and create irc.py (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650546 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [14:43:06] !log upload puppet_5.5.22-1 to wikimedia-buster [14:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:14] (03CR) 10Elukey: [C: 03+2] Port IRCSocketHandler from Spickerack and create irc.py [software/pywmflib] - 10https://gerrit.wikimedia.org/r/650546 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [14:43:43] !log disable puppet to upgrade puppet master packages [14:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/641315 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [14:49:07] (03PS2) 10Volans: tests: fix deprecated pytest argument [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651172 [14:51:13] (03CR) 10Volans: [C: 03+2] tests: fix deprecated pytest argument [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651172 (owner: 10Volans) [14:52:41] (03Merged) 10jenkins-bot: tests: fix deprecated pytest argument [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651172 (owner: 10Volans) [15:04:19] (03PS4) 10ArielGlenn: add platform engineering folks to snapshot and dumpsdata server access [puppet] - 10https://gerrit.wikimedia.org/r/649077 [15:04:43] (03CR) 10jerkins-bot: [V: 04-1] add platform engineering folks to snapshot and dumpsdata server access [puppet] - 10https://gerrit.wikimedia.org/r/649077 (owner: 10ArielGlenn) [15:06:58] (03PS2) 10Volans: interactive: improve confirmation capabilities [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651173 [15:09:08] (03PS5) 10ArielGlenn: add platform engineering folks to snapshot and dumpsdata server access [puppet] - 10https://gerrit.wikimedia.org/r/649077 [15:10:13] (03CR) 10ArielGlenn: "If this looks good to folks I need to run it by CPT managers again since they are explicitly named in the workflow, per policy. Once that'" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/649077 (owner: 10ArielGlenn) [15:15:37] (03CR) 10Volans: [C: 03+2] interactive: improve confirmation capabilities [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651173 (owner: 10Volans) [15:16:47] (03Merged) 10jenkins-bot: interactive: improve confirmation capabilities [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651173 (owner: 10Volans) [15:26:11] (03CR) 10Arturo Borrero Gonzalez: Allow specific flows from 172.16/12 to prod (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/643269 (https://phabricator.wikimedia.org/T209082) (owner: 10Ayounsi) [15:36:48] (03PS1) 10Andrew Bogott: OpenStack haproxy: logs to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/651186 (https://phabricator.wikimedia.org/T268175) [15:48:16] (03PS1) 10Andrew Bogott: OpenStack haproxy: direct logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/651187 (https://phabricator.wikimedia.org/T268175) [15:50:14] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack haproxy: direct logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/651187 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [15:53:39] (03CR) 10RLazarus: "Sorry for the delay getting back to this. Some thoughts:" [puppet] - 10https://gerrit.wikimedia.org/r/648385 (owner: 10Dzahn) [16:07:52] (03PS3) 10Razzi: superset: Switch traffic from analytics-tool1004 to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/650522 [16:38:04] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10RobH) a:05RobH→03Cmjohnson @Cmjohnson: rdb1011 doesn't ping via idrac interface (or ssh, or https). Please investigate rdb1011, I'll continue the steps to install rdb1012. [16:40:14] (03PS1) 10Ahmon Dancy: Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651207 [16:42:33] as a heads up we're going to deploy some localization backports to fix some visual breakage. [16:42:58] and by "we're" I mean dancy is [16:43:23] (03CR) 10Ahmon Dancy: [C: 03+2] Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651207 (owner: 10Ahmon Dancy) [16:43:42] (03PS1) 10RobH: rdb1012 install server updates [puppet] - 10https://gerrit.wikimedia.org/r/651197 (https://phabricator.wikimedia.org/T266724) [16:49:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10RobH) [16:49:59] (03CR) 10RobH: [C: 03+2] rdb1012 install server updates [puppet] - 10https://gerrit.wikimedia.org/r/651197 (https://phabricator.wikimedia.org/T266724) (owner: 10RobH) [16:53:55] (03PS2) 10CRusnov: check_ripe_atlas.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/646879 (https://phabricator.wikimedia.org/T247364) [16:54:56] 10Operations, 10Performance-Team, 10Traffic, 10serviceops, and 2 others: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Aklapper) And we might face this again, see T270631 [16:55:02] 10Operations, 10Puppet, 10User-jbond: Update puppet infrastructure latest 5.5 version - https://phabricator.wikimedia.org/T265139 (10jbond) The puppetmasters and puppetdb servers are currently running puppet 5.5.21 and packages have been uploaded to apt.w.o [16:55:35] 10Operations, 10Traffic, 10serviceops, 10Performance Issue: When logged in, loading the frwiki homepage takes a very long time - https://phabricator.wikimedia.org/T270631 (10Aklapper) [16:58:56] (03CR) 10Bstorm: etcd: make snapshot interval configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [17:04:52] (03PS2) 10Ahmon Dancy: Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651207 [17:04:55] (03PS1) 10Ahmon Dancy: Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651199 [17:04:58] (03PS1) 10Ahmon Dancy: Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651200 [17:05:01] (03PS1) 10Ahmon Dancy: Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651201 [17:05:04] (03PS1) 10Ahmon Dancy: Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651202 [17:06:20] (03PS4) 10Bstorm: etcd: make snapshot interval configurable [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) [17:07:36] (03PS1) 10Jgreen: nsca_frack.cfg.erb remove check_ipn_redir [puppet] - 10https://gerrit.wikimedia.org/r/651203 (https://phabricator.wikimedia.org/T270529) [17:09:02] (03CR) 10SBassett: "> Re-working scap's i18n build to not fatal if you pull it into prod config too early would be a fair bit of engineering; that'd be RelEng" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521233 (https://phabricator.wikimedia.org/T181217) (owner: 10Reedy) [17:09:27] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [17:09:51] (03CR) 10Bstorm: etcd: make snapshot interval configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [17:10:52] (03CR) 10Giuseppe Lavagetto: [C: 03+1] etcd: make snapshot interval configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [17:11:34] <_joe_> bstorm: I just realized one of the two templates you modified is just a spurious copy that's committed in the wrong place [17:11:44] <_joe_> good job maintaing its consistency though :P [17:12:06] <_joe_> (I'm also confident I did commit the file to the wrong place, so sorry for the additional work) [17:12:32] <_joe_> yeah it was me, ofc [17:14:14] :) I figured best to maintain consistency for now on this patch [17:14:36] _joe_: ^ [17:14:49] <_joe_> yeah I agree [17:15:01] <_joe_> I can remove that damn file afterwards [17:16:27] (03PS1) 10Andrew Bogott: Labweb100[1-2] to Buster [puppet] - 10https://gerrit.wikimedia.org/r/651204 (https://phabricator.wikimedia.org/T269004) [17:19:05] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` rdb1012.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [17:19:08] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb1012.eqiad.wmnet'] ` Of which those **FAILED**: ` ['rdb1012.eqiad.wmnet'] ` [17:19:22] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` rdb1012.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [17:19:59] (03CR) 10Andrew Bogott: [C: 03+2] Labweb100[1-2] to Buster [puppet] - 10https://gerrit.wikimedia.org/r/651204 (https://phabricator.wikimedia.org/T269004) (owner: 10Andrew Bogott) [17:20:42] (03CR) 10Bstorm: [C: 03+2] etcd: make snapshot interval configurable [puppet] - 10https://gerrit.wikimedia.org/r/648354 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [17:22:48] 10Operations, 10serviceops, 10Patch-For-Review, 10cloud-services-team (Kanban): Upgrade labweb servers to buster - https://phabricator.wikimedia.org/T269004 (10Andrew) I think I'm ready to do this whenever -- I upgraded our test host and it looks just fine. [17:27:22] (03PS1) 10Effie Mouzeli: hiera: upgrade mc1024, mc2024 to buster [puppet] - 10https://gerrit.wikimedia.org/r/651227 (https://phabricator.wikimedia.org/T213089) [17:29:37] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb1012.eqiad.wmnet'] ` Of which those **FAILED**: ` ['rdb1012.eqiad.wmnet'] ` [17:29:41] 10Operations, 10Puppet, 10Patch-For-Review: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 (10Majavah) [17:29:57] 10Operations, 10Traffic, 10serviceops, 10Performance Issue: When logged in, loading the frwiki homepage takes a very long time - https://phabricator.wikimedia.org/T270631 (10Legoktm) Yeah, this looks like the exact same issue as {T266865}. `Memcached error for key "WANCache:frwiki:featured-feeds:1:fr|#|v"... [17:31:16] (03CR) 10Ahmon Dancy: [C: 03+2] Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651202 (owner: 10Ahmon Dancy) [17:31:18] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1012.eqiad.wmnet with reason: REIMAGE [17:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:31] (03CR) 10Ahmon Dancy: [C: 03+2] Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651201 (owner: 10Ahmon Dancy) [17:31:35] (03CR) 10Ahmon Dancy: [C: 03+2] Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651200 (owner: 10Ahmon Dancy) [17:31:40] (03CR) 10Ahmon Dancy: [C: 03+2] Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651199 (owner: 10Ahmon Dancy) [17:32:04] (03CR) 10Bstorm: [C: 03+1] "It should be possible to check the general state of "too slow" logs in toolsbeta, cherry-pick this there and then see if there's any signi" [puppet] - 10https://gerrit.wikimedia.org/r/650470 (https://phabricator.wikimedia.org/T267966) (owner: 10Arturo Borrero Gonzalez) [17:32:17] (03CR) 10Jgreen: [C: 03+2] nsca_frack.cfg.erb remove check_ipn_redir [puppet] - 10https://gerrit.wikimedia.org/r/651203 (https://phabricator.wikimedia.org/T270529) (owner: 10Jgreen) [17:33:09] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on rdb1012.eqiad.wmnet with reason: REIMAGE [17:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:27] (03PS1) 10Ahmon Dancy: Disable PHP L10n in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651228 (https://phabricator.wikimedia.org/T270560) [17:37:11] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: upgrade mc1024, mc2024 to buster [puppet] - 10https://gerrit.wikimedia.org/r/651227 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [17:38:57] effie: \o/ [17:39:07] how many left??? [17:40:31] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc1024.eqiad.wmnet ` The log can be... [17:40:43] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc2024.codfw.wmnet ` The log can be... [17:41:02] (03CR) 10Jforrester: [C: 03+1] Disable PHP L10n in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651228 (https://phabricator.wikimedia.org/T270560) (owner: 10Ahmon Dancy) [17:41:17] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.5 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651231 [17:42:23] (03CR) 10Elukey: [C: 03+1] "\o/" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651231 (owner: 10Volans) [17:45:16] (03PS1) 10Legoktm: Don't load entire feed just to output the link to it [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651209 (https://phabricator.wikimedia.org/T266900) [17:45:16] elukey: let me count :) [17:45:47] dancy: are you deploying right now? [17:46:08] elukey: 12 in total after mc1024 is done [17:46:17] nice :) [17:46:24] legoktm: Not at the moment. [17:46:30] getting stuff prepared [17:46:43] Jump in if you have something to do [17:47:13] (03CR) 10Legoktm: [C: 03+2] Don't load entire feed just to output the link to it [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651209 (https://phabricator.wikimedia.org/T266900) (owner: 10Legoktm) [17:47:23] elukey: the earliest between reimaging hosts in eqiad is ~6 hrs [17:47:59] but I reimage in pairs, so basicaly we have 6 more re-installations left [17:48:11] effie: yeah but let's not push this to the extreme, you are already keeping a big pace! [17:48:11] \o/ [17:48:50] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.5 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651231 (owner: 10Volans) [17:49:57] (03CR) 10Thcipriani: [C: 03+2] Disable PHP L10n in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651228 (https://phabricator.wikimedia.org/T270560) (owner: 10Ahmon Dancy) [17:50:04] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.5 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651231 (owner: 10Volans) [17:50:27] dancy: I'm behind you in the jenkins queue, but if I could jump ahead and sync out the FeaturedFeeds patch, it shouldn't take me more than a few minutes [17:50:34] elukey: yeah, I wait for a hosts RX traffic to reach the levels of the other hosts before moving to the next [17:50:57] (03Merged) 10jenkins-bot: Disable PHP L10n in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651228 (https://phabricator.wikimedia.org/T270560) (owner: 10Ahmon Dancy) [17:50:58] legoktm: Works for me. Message me when you're done. [17:54:21] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1024.eqiad.wmnet with reason: REIMAGE [17:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:06] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2024.codfw.wmnet with reason: REIMAGE [17:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:23] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1024.eqiad.wmnet with reason: REIMAGE [17:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:49] !log legoktm@deploy1001 Synchronized /srv/mediawiki-staging/php-1.36.0-wmf.22/extensions/FeaturedFeeds/includes/FeaturedFeeds.php: Don't load entire feed just to output the link to it (T266900) (duration: 01m 01s) [17:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:53] T266900: FeaturedFeeds loads all feed content just to output the feed URLs on the main page - https://phabricator.wikimedia.org/T266900 [17:57:17] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651207 (owner: 10Ahmon Dancy) [17:58:27] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2024.codfw.wmnet with reason: REIMAGE [17:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:52] dancy: done, thanks [17:59:02] 👍🏾 [17:59:57] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651199 (owner: 10Ahmon Dancy) [18:00:02] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651200 (owner: 10Ahmon Dancy) [18:00:09] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651201 (owner: 10Ahmon Dancy) [18:00:14] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651202 (owner: 10Ahmon Dancy) [18:00:16] (03PS1) 10RobH: rdb1011 install params [puppet] - 10https://gerrit.wikimedia.org/r/651232 (https://phabricator.wikimedia.org/T266724) [18:00:21] (03CR) 10Bstorm: "This one is still quite necessary:" [puppet] - 10https://gerrit.wikimedia.org/r/650280 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [18:00:51] dancy: Confirmed fixed on Beta Cluster, FWIW. [18:00:58] (03PS2) 10RobH: rdb1011 install params [puppet] - 10https://gerrit.wikimedia.org/r/651232 (https://phabricator.wikimedia.org/T266724) [18:01:22] James_F: Excellent. Thanks for the notification. I'll discuss the remaining issues with Tyler in January [18:01:29] Cool. :-) [18:01:30] (03CR) 10RobH: [C: 03+2] rdb1011 install params [puppet] - 10https://gerrit.wikimedia.org/r/651232 (https://phabricator.wikimedia.org/T266724) (owner: 10RobH) [18:01:37] 10Operations, 10Performance-Team, 10Traffic, 10serviceops, and 2 others: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Legoktm) [18:03:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['rdb1011.eqiad.wmnet', 'rdb1012.eqiad.wmnet'] ` The lo... [18:05:34] (03CR) 10Bstorm: [C: 03+2] kubeadm and paws: tuning options for stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/650280 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [18:05:45] (03PS1) 10Volans: Upstream release v0.0.5 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/651233 [18:06:42] 10Operations, 10Traffic, 10serviceops, 10Performance Issue: When logged in, loading the frwiki homepage takes a very long time - https://phabricator.wikimedia.org/T270631 (10Legoktm) >>! In T270631#6706192, @Legoktm wrote: > one of the frwiki feeds is too big to fit in memcache. Either we need to figure ou... [18:06:45] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10RobH) [18:07:14] Starting deploy for T270619 fixes [18:07:14] T270619: Consider two core backports to 1.36.0-wmf.22 (for now untranslated `User contributions` on Special:Contributions) - https://phabricator.wikimedia.org/T270619 [18:08:48] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.5 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/651233 (owner: 10Volans) [18:09:32] !log dancy@deploy1001 Started scap: Backport of l10n changes for T270619 [18:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:03] (03Merged) 10jenkins-bot: Upstream release v0.0.5 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/651233 (owner: 10Volans) [18:14:35] !log uploaded python3-wmflib_0.0.5 to apt.wikimedia.org buster-wikimedia [18:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:48] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1012.eqiad.wmnet with reason: REIMAGE [18:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:13] Thanks dancy & thcipriani [18:18:27] No problem! [18:18:41] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2024.codfw.wmnet'] ` and were **ALL** successful. [18:18:45] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1012.eqiad.wmnet with reason: REIMAGE [18:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:46] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1024.eqiad.wmnet'] ` and were **ALL** successful. [18:25:12] (03PS1) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 [18:25:54] (03CR) 10CDanis: "As discussed, this is a pre-review -- still a lot of polishing needed :)" [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 (owner: 10CDanis) [18:26:09] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1011.eqiad.wmnet with reason: REIMAGE [18:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:13] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1011.eqiad.wmnet with reason: REIMAGE [18:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:52] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10RobH) a:05Cmjohnson→03RobH [18:29:48] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10RobH) [18:30:38] !log dancy@deploy1001 Finished scap: Backport of l10n changes for T270619 (duration: 21m 12s) [18:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:41] T270619: Consider two core backports to 1.36.0-wmf.22 (for now untranslated `User contributions` on Special:Contributions) - https://phabricator.wikimedia.org/T270619 [18:34:10] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb1012.eqiad.wmnet', 'rdb1011.eqiad.wmnet'] ` and were **ALL** successful. [18:34:42] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install rdb101[12] - https://phabricator.wikimedia.org/T266724 (10RobH) 05Open→03Resolved @jijiki: I've added you as a subscriber so you get a notification of this completed racking for two hosts; rdb101[12] are all done with their imaging a... [18:35:35] (03PS2) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 [18:36:29] (03PS3) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 [18:42:47] (03PS1) 10Legoktm: noc: Fix "Currently active MediaWiki versions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651241 (https://phabricator.wikimedia.org/T235338) [18:45:33] 10Operations, 10Traffic, 10serviceops, 10Performance Issue: When logged in, loading the frwiki homepage takes a very long time - https://phabricator.wikimedia.org/T270631 (10Legoktm) p:05Triage→03High [18:45:59] (03CR) 10Jforrester: "Possibly broken by the scap modules being moved out of the local repo and into scap proper?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651241 (https://phabricator.wikimedia.org/T235338) (owner: 10Legoktm) [18:47:09] (03PS4) 10Razzi: role::analytics_cluster::ui::dashboards: Add superset to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/650179 (https://phabricator.wikimedia.org/T268219) [18:51:15] (03CR) 10Reedy: [C: 03+1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651241 (https://phabricator.wikimedia.org/T235338) (owner: 10Legoktm) [19:02:28] (03PS2) 10Legoktm: aptrepo: Add thirdparty/pyall component for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/650010 (https://phabricator.wikimedia.org/T241195) [19:03:25] (03CR) 10Legoktm: [C: 03+2] aptrepo: Add thirdparty/pyall component for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/650010 (https://phabricator.wikimedia.org/T241195) (owner: 10Legoktm) [19:04:18] (03PS1) 10Andrew Bogott: haproxy: add a note about the $logging param [puppet] - 10https://gerrit.wikimedia.org/r/651246 [19:05:48] (03PS1) 10Bstorm: kubeadm: correct spacing on the stacked control plane options [puppet] - 10https://gerrit.wikimedia.org/r/651247 (https://phabricator.wikimedia.org/T267966) [19:11:18] (03PS1) 10Elukey: Set a more restrictive umask for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/651249 (https://phabricator.wikimedia.org/T270629) [19:12:56] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27225/console" [puppet] - 10https://gerrit.wikimedia.org/r/651249 (https://phabricator.wikimedia.org/T270629) (owner: 10Elukey) [19:13:18] (03CR) 10Andrew Bogott: "@Gilles, I'm cc'ing you on this because I believe that Thumbor is the only project that currently has logging enabled. If you have securi" [puppet] - 10https://gerrit.wikimedia.org/r/651246 (owner: 10Andrew Bogott) [19:15:03] RECOVERY - WDQS high update lag on wdqs1011 is OK: (C)3600 ge (W)1200 ge 588.5 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:15:24] (03PS1) 10Legoktm: aptrepo: Fix name of Update for pyall [puppet] - 10https://gerrit.wikimedia.org/r/651251 [19:16:10] (03CR) 10Legoktm: [C: 03+2] aptrepo: Fix name of Update for pyall [puppet] - 10https://gerrit.wikimedia.org/r/651251 (owner: 10Legoktm) [19:16:46] (03CR) 10Elukey: [V: 03+1] "This is just to test, to see if it can give us some hints about real use case scenarios.." [puppet] - 10https://gerrit.wikimedia.org/r/651249 (https://phabricator.wikimedia.org/T270629) (owner: 10Elukey) [19:20:37] (03PS1) 10Andrew Bogott: Direct more OpenStack logs to kafka/kibana [puppet] - 10https://gerrit.wikimedia.org/r/651252 (https://phabricator.wikimedia.org/T268175) [19:22:36] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) We probably don't and can't support backups (restoring outdated data would break business logic),... [19:24:18] (03CR) 10Andrew Bogott: [C: 03+2] Direct more OpenStack logs to kafka/kibana [puppet] - 10https://gerrit.wikimedia.org/r/651252 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [19:28:10] (03CR) 10Thcipriani: [C: 03+1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651241 (https://phabricator.wikimedia.org/T235338) (owner: 10Legoktm) [19:29:46] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27226/console" [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [19:30:02] (03CR) 10Elukey: [C: 03+1] sqoop: Ensure /tmp/sqoop-jars/ is present [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [19:30:40] (03CR) 10Elukey: [C: 03+1] "Looks good, let's also have somebody from the Traffic team to validate :)" [puppet] - 10https://gerrit.wikimedia.org/r/650522 (owner: 10Razzi) [19:33:44] (03CR) 10Elukey: [C: 03+1] role::analytics_cluster::ui::dashboards: Add superset to an-tool1010 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/650179 (https://phabricator.wikimedia.org/T268219) (owner: 10Razzi) [19:34:10] (03CR) 10Elukey: [C: 03+1] "Nice thanks a lot!" [labs/private] - 10https://gerrit.wikimedia.org/r/650275 (owner: 10Razzi) [19:36:43] (03PS24) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [19:36:50] (03PS4) 10CRusnov: modules/icinga/files/raid_handler.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/647369 (https://phabricator.wikimedia.org/T247364) [19:36:57] (03PS3) 10CRusnov: check_ripe_atlas.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/646879 (https://phabricator.wikimedia.org/T247364) [19:37:04] (03PS2) 10CRusnov: Port elasticsearch/es-tool.py to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/644591 (https://phabricator.wikimedia.org/T247364) [19:37:11] (03PS3) 10CRusnov: modules/tcpircbot: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/628436 (https://phabricator.wikimedia.org/T247364) [19:45:21] (03Abandoned) 10Andrew Bogott: OpenStack haproxy: logs to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/651186 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [19:47:07] !log repool wdqs1011 [19:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:34] (03CR) 10Urbanecm: [C: 03+1] "FTR: I talked about this with Joe from T&S, and he's happy about the change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650988 (https://phabricator.wikimedia.org/T180896) (owner: 10Urbanecm) [19:55:14] (03CR) 10Razzi: [C: 03+2] sqoop: Ensure /tmp/sqoop-jars/ is present [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) (owner: 10Razzi) [19:57:55] (03PS1) 10Andrew Bogott: Designate: filter health checks from Kibana logs [puppet] - 10https://gerrit.wikimedia.org/r/651254 (https://phabricator.wikimedia.org/T268175) [19:58:36] (03CR) 10Bstorm: [C: 03+2] kubeadm: correct spacing on the stacked control plane options [puppet] - 10https://gerrit.wikimedia.org/r/651247 (https://phabricator.wikimedia.org/T267966) (owner: 10Bstorm) [20:01:07] (03CR) 10Andrew Bogott: [C: 03+2] Designate: filter health checks from Kibana logs [puppet] - 10https://gerrit.wikimedia.org/r/651254 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [20:01:54] (03PS1) 10Jgiannelos: wikifeeds: bump to 2020-12-21-180823-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/651255 [20:13:47] (03CR) 10CDanis: [C: 03+1] superset: Switch traffic from analytics-tool1004 to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/650522 (owner: 10Razzi) [20:15:52] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: bump to 2020-12-21-180823-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/651255 (owner: 10Jgiannelos) [20:17:31] (03Merged) 10jenkins-bot: wikifeeds: bump to 2020-12-21-180823-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/651255 (owner: 10Jgiannelos) [20:23:24] (03PS1) 10Andrew Bogott: Neutron: change use_syslog from True to true [puppet] - 10https://gerrit.wikimedia.org/r/651256 (https://phabricator.wikimedia.org/T268175) [20:25:08] (03CR) 10Andrew Bogott: [C: 03+2] Neutron: change use_syslog from True to true [puppet] - 10https://gerrit.wikimedia.org/r/651256 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [20:28:04] (03PS1) 10Thcipriani: DNM: change test [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651257 [20:34:43] (03PS4) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 [20:44:08] (03Abandoned) 10Thcipriani: DNM: change test [core] (wmf/1.36.0-wmf.22) - 10https://gerrit.wikimedia.org/r/651257 (owner: 10Thcipriani) [20:46:10] (03PS1) 10Andrew Bogott: Designate: further syslog filters for designate-api [puppet] - 10https://gerrit.wikimedia.org/r/651259 (https://phabricator.wikimedia.org/T268175) [20:51:09] (03CR) 10Andrew Bogott: [C: 03+2] Designate: further syslog filters for designate-api [puppet] - 10https://gerrit.wikimedia.org/r/651259 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [20:52:15] (03CR) 10CDanis: [C: 03+1] modules/tcpircbot: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/628436 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [20:58:31] (03PS5) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 [21:07:57] (03CR) 10CRusnov: [C: 03+2] netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [21:09:43] !log merging change 643354 for Netbox 2.9 support, puppet disabled on production machines until testing completed T266487 [21:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:47] T266487: Upgrade Netbox and accompanying to the 2.9 series (Tracking Task) - https://phabricator.wikimedia.org/T266487 [21:16:28] (03CR) 10Bstorm: Openstack galera cluster: move data to /srv (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651012 (https://phabricator.wikimedia.org/T270552) (owner: 10Andrew Bogott) [21:19:01] (03PS2) 10Andrew Bogott: Openstack galera cluster: move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/651012 (https://phabricator.wikimedia.org/T270552) [21:19:05] (03CR) 10CRusnov: [V: 03+2 C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/651268 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [21:19:08] (03CR) 10jerkins-bot: [V: 04-1] netbox: Fix dependency loop introduced in previous patch [puppet] - 10https://gerrit.wikimedia.org/r/651268 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [21:19:31] (03PS1) 10CDanis: dbctl: README: document section 'flavor' [software/conftool] - 10https://gerrit.wikimedia.org/r/651269 (https://phabricator.wikimedia.org/T269324) [21:19:39] (03CR) 10Ori.livneh: "If this looks good, could you please deploy it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651267 (owner: 10Ori.livneh) [21:20:28] (03PS2) 10CRusnov: netbox: Fix dependency loop introduced in previous patch [puppet] - 10https://gerrit.wikimedia.org/r/651268 (https://phabricator.wikimedia.org/T266487) [21:21:05] (03CR) 10Andrew Bogott: [C: 03+2] Openstack galera cluster: move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/651012 (https://phabricator.wikimedia.org/T270552) (owner: 10Andrew Bogott) [21:21:53] (03CR) 10Bstorm: [C: 03+1] Openstack galera cluster: move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/651012 (https://phabricator.wikimedia.org/T270552) (owner: 10Andrew Bogott) [21:22:13] (03PS3) 10CRusnov: netbox: Fix dependency loop introduced in previous patch [puppet] - 10https://gerrit.wikimedia.org/r/651268 (https://phabricator.wikimedia.org/T266487) [21:23:13] (03CR) 10CRusnov: [C: 03+2] netbox: Fix dependency loop introduced in previous patch [puppet] - 10https://gerrit.wikimedia.org/r/651268 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [21:27:26] (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/651269 (https://phabricator.wikimedia.org/T269324) (owner: 10CDanis) [21:29:40] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/651270 (owner: 10CRusnov) [21:29:48] (03CR) 10CRusnov: [C: 03+2] netbox: Fix typo in internal redis configuration [puppet] - 10https://gerrit.wikimedia.org/r/651270 (owner: 10CRusnov) [21:42:04] !log manually imported debs to buster-wikimedia thirdparty/pyall component (T241195) [21:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:08] T241195: Add python3.8 to buster-wikimedia pyall component - https://phabricator.wikimedia.org/T241195 [21:53:33] (03CR) 10Dave Pifke: [C: 03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651267 (owner: 10Ori.livneh) [21:53:48] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/651274 (owner: 10CRusnov) [21:54:24] (03CR) 10CRusnov: [C: 03+2] netbox: Fix typos in configuration.py.erb [puppet] - 10https://gerrit.wikimedia.org/r/651274 (owner: 10CRusnov) [21:55:21] 10Operations, 10serviceops: Add python3.8 to buster-wikimedia pyall component - https://phabricator.wikimedia.org/T241195 (10Legoktm) 05Open→03Resolved a:05MoritzMuehlenhoff→03Legoktm During the import I ran into: {P13619} Moritz said this has happened with Jenkins before, and it seems like something... [21:58:33] 10Operations, 10MW-on-K8s, 10Shellbox, 10serviceops: Decide on logging in k8s for ShellBox - https://phabricator.wikimedia.org/T263545 (10Legoktm) [21:58:42] 10Operations, 10MW-on-K8s, 10Shellbox, 10serviceops, and 3 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Legoktm) [21:58:44] 10Operations, 10MW-on-K8s, 10Shellbox, 10serviceops, 10Platform Team Workboards (Purple): Make Shellbox actually do streaming - https://phabricator.wikimedia.org/T268427 (10Legoktm) [22:03:59] (03PS6) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 [22:13:45] Hey all - I'd like to try to deploy the security patch for T270453. If it doesn't test well, I'll take it off. [22:14:20] (03PS7) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 [22:18:29] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @elukey that will work, I will add 1 to B4 and 2 to C2. Thanks! [22:18:40] !log Re-enabling puppet on Netbox production instances after havintg tested netbox2001 with new puppet code T266487 [22:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:44] T266487: Upgrade Netbox and accompanying to the 2.9 series (Tracking Task) - https://phabricator.wikimedia.org/T266487 [22:25:19] !log crusnov@deploy1001 Started deploy [netbox/deploy@0362a12]: Deploy of 2.9.10 to netbox-dev for script testing [22:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:57] !log Deployed security patch T270453 [22:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:20] !log crusnov@deploy1001 Finished deploy [netbox/deploy@0362a12]: Deploy of 2.9.10 to netbox-dev for script testing (duration: 01m 01s) [22:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:25] !log crusnov@deploy1001 Started deploy [netbox/deploy@0362a12]: Deploy of 2.9.10 to netbox-dev for script testing p2 [22:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:30] !log crusnov@deploy1001 Finished deploy [netbox/deploy@0362a12]: Deploy of 2.9.10 to netbox-dev for script testing p2 (duration: 00m 05s) [22:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:36] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) [22:31:43] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) 05Open→03Resolved a:03jcrespo For a proof of concept,... [22:32:25] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) [22:32:27] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) [22:32:45] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage: Depool an entire swift cluster for a datacenter and do performance testing of batch downloads of wiki media (querying swift and/or MediaWiki) - https://phabricator.wikimedia.org/T267338 (10jcrespo) [22:32:47] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) [22:32:49] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) [22:33:59] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal: Research storage solutions for media backups - https://phabricator.wikimedia.org/T264190 (10jcrespo) 05Open→03Resolved a:03jcrespo Research was done and reflected at private document https://docs.google.com/document/d/1kmaDIrae4HsE... [22:34:02] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) [22:37:08] 10Operations, 10Data-Persistence-Backup, 10Goal: Define a methodology to track WMF services backup requirements - https://phabricator.wikimedia.org/T264274 (10jcrespo) Some work was done on this at https://docs.google.com/spreadsheets/d/1aAo8COkz3_P3NS73i-ZZXu0gocx1J6mlzA79Drwo8CA but needs more validation o... [22:38:30] 10Operations, 10Data-Persistence-Backup, 10Goal: Track all directly-owned SRE datasets into the new inventory system - https://phabricator.wikimedia.org/T264275 (10jcrespo) Data-persistence owned or known were tracked at: https://docs.google.com/spreadsheets/d/1aAo8COkz3_P3NS73i-ZZXu0gocx1J6mlzA79Drwo8CA but... [22:40:11] PROBLEM - Check size of conntrack table on ncredir3001 is CRITICAL: CRITICAL: nf_conntrack is 100 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:41:51] RECOVERY - Check size of conntrack table on ncredir3001 is OK: OK: nf_conntrack is 1 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:42:47] (03CR) 10Bstorm: [C: 03+1] passwords: Add legoktm to Cloud VPS root [labs/private] - 10https://gerrit.wikimedia.org/r/650008 (owner: 10Legoktm) [23:00:03] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/651280 (owner: 10CRusnov) [23:04:25] 10Operations, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10Legoktm) >>! In T261369#6695002, @akosiaris wrote: >>>! In T261369#6693707, @Krinkle wrote: >> Some... [23:10:55] (03PS8) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 [23:11:43] jouncebot: next [23:11:43] No deployments scheduled for the forseeable future! [23:12:44] (03PS2) 10CRusnov: Rebuild Netbox 2.9.10 dependencies [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/651280 [23:13:23] (03PS9) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 [23:13:25] (03PS2) 10Legoktm: passwords: Add legoktm to Cloud VPS root [labs/private] - 10https://gerrit.wikimedia.org/r/650008 [23:13:32] (03CR) 10Legoktm: [V: 03+2 C: 03+2] passwords: Add legoktm to Cloud VPS root [labs/private] - 10https://gerrit.wikimedia.org/r/650008 (owner: 10Legoktm) [23:16:43] (03CR) 10Legoktm: [C: 03+2] noc: Fix "Currently active MediaWiki versions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651241 (https://phabricator.wikimedia.org/T235338) (owner: 10Legoktm) [23:17:38] (03Merged) 10jenkins-bot: noc: Fix "Currently active MediaWiki versions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651241 (https://phabricator.wikimedia.org/T235338) (owner: 10Legoktm) [23:20:31] !log legoktm@deploy1001 Synchronized docroot/noc/conf/index.php: noc: Fix "Currently active MediaWiki versions" (T235338) (duration: 00m 54s) [23:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:35] T235338: "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 [23:20:44] https://noc.wikimedia.org/conf/ it's magic [23:21:17] nice fix legoktm :) [23:22:17] 10Operations, 10Scap, 10Wikimedia-General-or-Unknown, 10serviceops, and 2 others: "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10Legoktm) 05Open→03Resolved a:05jijiki→03Legoktm [23:22:28] ty :D [23:33:09] hey ppl! question: there is any difference in ssh to Cloud instances that have webproxy pointing to wmflabs than wmcloud ? I've created a new instace and I can't login [23:41:47] no, but the internal domain names have changed now [23:42:08] what's your project / the name of the instance? [23:42:14] dsaez: ^ [23:42:44] wikipediaWikidata.wmf-research-tools.eqiad.wmflabs legoktm [23:43:31] so if you look at https://openstack-browser.toolforge.org/project/wmf-research-tools you can see the internal name for the instance is now wikipediaWikidata.wmf-research-tools.eqiad1.wikimedia.cloud [23:43:47] so you'll want to ssh into that (and you might need to make some adjustments to your .ssh/config) [23:44:54] ooh... i see [23:45:03] thanks, let me try [23:45:58] https://lists.wikimedia.org/pipermail/cloud-announce/2020-September/000311.html has all the details [23:46:01] legoktm, it works!!! [23:46:08] thanks, you saved my life [23:46:08] (I'd also suggest subscribing to that mailing list for future announcements) [23:46:09] woot [23:46:11] :))) [23:46:26] I had already updated the config [23:46:38] but I didn't know how to check the domain from the horizon interface [23:57:27] legoktm, sorry for bothering, but when I check wikidataWikipedia.wmcloud.org I get "No proxy is configured for this host name." but the web proxy is configured [23:58:32] hm [23:59:00] I wonder whether the uppercase letter is throwing it off [23:59:17] I'd suggest asking in #wikimedia-cloud