[00:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T0000). [00:00:04] RoanKattouw: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:09:00] RoanKattouw_: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/564183 depends on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/565154 which you didn't list. Did you want both? [00:09:01] I will do my own SWAT [00:09:10] Oh did I list them backwards? [00:09:24] I meant to only list 565154 [00:09:52] Edited wiki page to fix [00:10:02] Kk. [00:10:17] 564183 is for tomorrow (already listed there) [00:10:24] James_F|Away: Are you deploying or shall I? [00:11:28] You can. I'm going home. :-) [00:11:44] OK will do [00:12:10] (03PS1) 10Reedy: Revert "Optimise logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565157 [00:12:11] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Enable topics for suggested edits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565154 (owner: 10Catrope) [00:12:45] (03PS2) 10Reedy: Revert "Optimise logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565157 [00:13:21] (03Merged) 10jenkins-bot: GrowthExperiments: Enable topics for suggested edits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565154 (owner: 10Catrope) [00:14:41] (03PS3) 10Reedy: Restore pre censorship trwiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565157 (https://phabricator.wikimedia.org/T242932) [00:16:56] RoanKattouw_: ^ Wanna do trwiki a favour too? :) [00:17:04] Sure! [00:17:51] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: GrowthExperiments: Enable topics for suggested edits on testwiki (duration: 01m 04s) [00:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:53] (03CR) 10Reedy: "Original images are already optimised according to `optipng -o7`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565157 (https://phabricator.wikimedia.org/T242932) (owner: 10Reedy) [00:17:59] (03PS4) 10Catrope: Restore pre censorship trwiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565157 (https://phabricator.wikimedia.org/T242932) (owner: 10Reedy) [00:18:04] (03CR) 10Catrope: [C: 03+2] Restore pre censorship trwiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565157 (https://phabricator.wikimedia.org/T242932) (owner: 10Reedy) [00:18:11] (03Abandoned) 10Reedy: Revert "Provide a temporary trwiki logo marking two years of censorship" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506903 (owner: 10Jforrester) [00:19:09] (03Merged) 10jenkins-bot: Restore pre censorship trwiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565157 (https://phabricator.wikimedia.org/T242932) (owner: 10Reedy) [00:20:30] Yay IRCCloud is back [00:23:10] !log catrope@deploy1001 Synchronized static/images/project-logos/: Restore pre-censorship trwiki logos (T242932) (duration: 01m 05s) [00:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:13] T242932: Remove censorship bar from Turkish Wikipedia logo - https://phabricator.wikimedia.org/T242932 [00:26:40] thanks [00:36:06] (03PS3) 10Dzahn: devtools: add Hiera values for a deployment_server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/563618 [00:36:28] (03CR) 10jerkins-bot: [V: 04-1] devtools: add Hiera values for a deployment_server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/563618 (owner: 10Dzahn) [00:37:06] (03PS4) 10Dzahn: devtools: add Hiera values for a deployment_server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/563618 [00:40:41] !log set max_connections on db1133 (m5-master) back to 500 since the neutron connections seem fairly stable now T242817 [00:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:44] T242817: m5 ran out of connections after openstack upgrade to "Pike" - https://phabricator.wikimedia.org/T242817 [00:41:07] (03PS1) 10CRusnov: puppetdb report: Quick fix to blacklist persistent test VM [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565159 [00:41:39] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565159 (owner: 10CRusnov) [00:44:23] (03CR) 10Dzahn: [C: 03+2] devtools: add Hiera values for a deployment_server in cloud [puppet] - 10https://gerrit.wikimedia.org/r/563618 (owner: 10Dzahn) [01:00:01] (03CR) 10Holger Knust: "Where is the config.yaml supposed to be located? I am little fuzzy on how we want to inject the file." [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [01:00:04] twentyafterfour: How many deployers does it take to do Phabricator update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T0100). [01:04:51] (03PS1) 10Dzahn: devtools (cloud): change PHP version from 7.3 to 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/565160 [01:05:10] (03CR) 10Dzahn: [C: 03+2] devtools (cloud): change PHP version from 7.3 to 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/565160 (owner: 10Dzahn) [01:09:29] (03CR) 10Subramanya Sastry: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564805 (https://phabricator.wikimedia.org/T239806) (owner: 10Arlolra) [01:14:17] (03CR) 10Ppchelko: "> Patch Set 4:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [01:14:27] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565159 (owner: 10CRusnov) [01:19:05] (03CR) 10Ppchelko: "If you mean where should it be in the fs of the container, we put the configs under `/etc//config.yaml`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [01:53:38] (03PS1) 10Legoktm: Consistently capitalize MediaWiki properly [puppet] - 10https://gerrit.wikimedia.org/r/565166 [01:53:57] (03PS4) 10Mstyles: A/B test for MLR models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559614 (https://phabricator.wikimedia.org/T219534) [01:54:29] (03CR) 10Legoktm: Consistently capitalize MediaWiki properly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/565166 (owner: 10Legoktm) [02:35:18] !log krinkle@mwmaint1002 Change code_repo.repo_viewvc from 'https://svn.wikimedia.org/viewvc/mediawiki' to '' for 'MediaWiki' repo_name. Ref 2162cf2fc46cfe, T205361. [02:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:30] T205361: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 [02:37:38] Krinkle: what does that do? [02:38:12] (03CR) 10CRusnov: [C: 03+2] puppetdb report: Quick fix to blacklist persistent test VM [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565159 (owner: 10CRusnov) [02:43:59] (03PS1) 10CRusnov: puppetdb: Fix pre-existing error which now fails with blacklist [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565168 [02:44:20] (03CR) 10CRusnov: [V: 03+2 C: 03+2] "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565168 (owner: 10CRusnov) [02:44:22] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: Fix pre-existing error which now fails with blacklist [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565168 (owner: 10CRusnov) [02:45:06] (03PS1) 10CRusnov: puppetdb: fix pep8 complaint [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565169 [02:45:29] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: fix pep8 complaint [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565169 (owner: 10CRusnov) [02:45:41] legoktm: https://phabricator.wikimedia.org/T205361#5808066 [02:46:17] !log krinkle@mwmaint1002 Change code_repo.repo_viewvc from 'http://svn.wikimedia.org/viewvc/pywikipedia' to '' for repo_id 2 (pywikipedia) for. Ref 2162cf2fc46cfe. [02:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:24] ty :) [02:46:28] will work on that tonight [02:47:00] There is a UI for it at Special:RepoAdmin but we've disabled that in prod. Anyway, all good now. [02:47:04] cool [02:47:09] I'll log off the webchat now [03:17:28] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:48] ^ known fixing [03:29:58] (03PS2) 10CRusnov: puppetdb: Fix pre-existing error which now fails with blacklist [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565168 [03:31:26] (03CR) 10CRusnov: [C: 03+2] puppetdb: Fix pre-existing error which now fails with blacklist [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565168 (owner: 10CRusnov) [03:31:42] (03CR) 10CRusnov: "self merging because netbox is brokne right now." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565168 (owner: 10CRusnov) [03:35:02] okay fixed [03:37:40] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:19] (03PS1) 10CRusnov: puppetdb: Improve structure and separate VMs and devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565176 [04:15:42] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: Improve structure and separate VMs and devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565176 (owner: 10CRusnov) [06:55:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1080', diff saved to https://phabricator.wikimedia.org/P10166 and previous config saved to /var/cache/conftool/dbconfig/20200116-065505-marostegui.json [06:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:14] !log stop db1107 and db1080 replication in sync [06:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110', diff saved to https://phabricator.wikimedia.org/P10168 and previous config saved to /var/cache/conftool/dbconfig/20200116-072219-marostegui.json [07:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:37] !log Upgrade db1110 [07:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:03] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:30:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1110', diff saved to https://phabricator.wikimedia.org/P10169 and previous config saved to /var/cache/conftool/dbconfig/20200116-073012-marostegui.json [07:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:24] 10Operations, 10Quality-and-Test-Engineering-Team (QTE), 10Wikimedia-Mailing-lists, 10User-zeljkofilipin: Close QA mailing list - https://phabricator.wikimedia.org/T237383 (10akosiaris) There is no such action as "close the list" in mailman. What we can do is delete it, with an extra question of whether yo... [07:44:42] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The results from the compiler show some differences, but upon further analysis, they are only naming differences and a few added checks du" [puppet] - 10https://gerrit.wikimedia.org/r/564690 (owner: 10Giuseppe Lavagetto) [07:56:52] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1, but you may want to add "Mediawiki" in the typos file to avoid having to go through this again (and shift the burden to each patch own" [puppet] - 10https://gerrit.wikimedia.org/r/565166 (owner: 10Legoktm) [08:01:04] (03CR) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [08:02:34] 10Operations, 10Scap, 10serviceops: Make canary wait time configurable - https://phabricator.wikimedia.org/T217924 (10jijiki) [08:03:36] 10Operations, 10Release-Engineering-Team, 10serviceops: Hundreds of tags for `wikimedia/mediawiki-core` image - https://phabricator.wikimedia.org/T242775 (10Joe) a:03Joe [08:04:15] 10Operations, 10Release-Engineering-Team, 10serviceops: Hundreds of tags for `wikimedia/mediawiki-core` image - https://phabricator.wikimedia.org/T242775 (10Joe) The total number of images present on the registry is 1003. I'm going to slowly remove most of the old ones in the coming week. [08:05:52] <_joe_> !log deleting mediawiki-core docker images from september 2019 from the registry, T242775 [08:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:55] T242775: Hundreds of tags for `wikimedia/mediawiki-core` image - https://phabricator.wikimedia.org/T242775 [08:17:24] (03PS7) 10Joal: Add mediawiki-history-dumps rsync to labstore [puppet] - 10https://gerrit.wikimedia.org/r/564066 [08:20:38] !log reject RPKI invalids in eqord/eqiad - T220669 [08:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:41] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [08:27:05] !log installing OpenSSL security updates on Parsoid hosts [08:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:42] (03CR) 10ArielGlenn: [C: 03+1] "Thumbs up for the dump-related manifests." [puppet] - 10https://gerrit.wikimedia.org/r/565166 (owner: 10Legoktm) [08:34:10] (03PS1) 10Marostegui: site.pp: Add two comments [puppet] - 10https://gerrit.wikimedia.org/r/565228 [08:35:45] (03CR) 10Marostegui: [C: 03+2] site.pp: Add two comments [puppet] - 10https://gerrit.wikimedia.org/r/565228 (owner: 10Marostegui) [08:39:08] !log Upgrade deploy*, snapshot* to php 7.2.26 - T241222 [08:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:11] T241222: Update Wikimedia production to PHP 7.2.26 - https://phabricator.wikimedia.org/T241222 [08:50:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1110', diff saved to https://phabricator.wikimedia.org/P10170 and previous config saved to /var/cache/conftool/dbconfig/20200116-085047-marostegui.json [08:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:26] (03PS2) 10Muehlenhoff: Pass down MAC address of to installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) [08:53:07] (03CR) 10Muehlenhoff: "Updated the patch following option 3 from" [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) (owner: 10Muehlenhoff) [08:53:44] 10Operations, 10Traffic: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 (10ema) [08:53:57] 10Operations, 10Traffic: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 (10ema) p:05Triage→03Normal [08:54:07] (03CR) 10Marostegui: [C: 03+1] Pass down MAC address of to installing system via BOOTIF on Buster [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) (owner: 10Muehlenhoff) [08:55:58] !log cp3063: ats-backend-restart to clear things up after traffic_server crash T242952 [08:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:02] T242952: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 [08:58:04] RECOVERY - traffic_server backend process restarted on cp3063 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3063&var-layer=backend [08:59:33] (03PS8) 10Joal: Add mediawiki-history-dumps rsync to labstore [puppet] - 10https://gerrit.wikimedia.org/r/564066 [09:01:46] (03CR) 10jerkins-bot: [V: 04-1] Add mediawiki-history-dumps rsync to labstore [puppet] - 10https://gerrit.wikimedia.org/r/564066 (owner: 10Joal) [09:02:11] !log Updgrade cloudweb2001-dev.wikimedia.org,labweb[1001-1002].wikimedia.org to php 7.2.26 - T241222 [09:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:14] T241222: Update Wikimedia production to PHP 7.2.26 - https://phabricator.wikimedia.org/T241222 [09:02:40] 10Operations, 10Traffic: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 (10ema) [09:03:32] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: bump 'ops' retention to 4.5 months [puppet] - 10https://gerrit.wikimedia.org/r/564680 (owner: 10Filippo Giunchedi) [09:03:52] Krinkle: let me know if/when I can proceed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/564005/ [09:04:47] (03PS9) 10Joal: Add mediawiki-history-dumps rsync to labstore [puppet] - 10https://gerrit.wikimedia.org/r/564066 [09:07:05] there will be alerts re: prometheus restarting, expected [09:09:39] !log restart php-fpm on cloudweb2001-dev.wikimedia.org,labweb[1001-1002].wikimedia.org [09:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:25] actually I'll silence them [09:14:50] PROBLEM - Prometheus bast3004/ops restarted: beware possible monitoring artifacts on bast3004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [09:15:20] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:15:32] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:16:12] !log Updgrade mwmaint2001.codfw.wmnet,mwmaint1002.eqiad.wmnet,scandium.eqiad.wmnet, to php 7.2.26 - T241222 [09:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:15] T241222: Update Wikimedia production to PHP 7.2.26 - https://phabricator.wikimedia.org/T241222 [09:16:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:17:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:18:38] !log restart php-fpm on mwmaint2001.codfw.wmnet,mwmaint1002.eqiad.wmnet,scandium.eqiad.wmnet [09:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:40] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Don't we now have a cookbook for this? Last time I created VMs I used the cookbook (and it was awesome!). I 'd say move this to the cookbo" [puppet] - 10https://gerrit.wikimedia.org/r/565061 (https://phabricator.wikimedia.org/T242828) (owner: 10Volans) [09:18:52] 10Operations: ProdPasteBot uses deprecated certificate auth - https://phabricator.wikimedia.org/T242857 (10MoritzMuehlenhoff) p:05Triage→03Normal [09:19:14] 10Operations, 10Traffic: Docker registry needs cache to vary on Accept header value - https://phabricator.wikimedia.org/T242200 (10MoritzMuehlenhoff) p:05Triage→03Low [09:20:00] 10Operations, 10Puppet: Add check for changes applied at all runs - https://phabricator.wikimedia.org/T242910 (10MoritzMuehlenhoff) p:05Triage→03Normal [09:23:32] (03PS1) 10Alexandros Kosiaris: ganeti: Deprecate makevm.sh [puppet] - 10https://gerrit.wikimedia.org/r/565231 [09:23:34] (03PS1) 10Alexandros Kosiaris: ganeti: absent to the makevm script [puppet] - 10https://gerrit.wikimedia.org/r/565232 [09:23:36] (03PS1) 10Alexandros Kosiaris: ganeti: Remove makevm.sh [puppet] - 10https://gerrit.wikimedia.org/r/565233 [09:23:54] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Here we go https://gerrit.wikimedia.org/r/#/q/topic:remove_makevm+(status:open+OR+status:merged)" [puppet] - 10https://gerrit.wikimedia.org/r/565061 (https://phabricator.wikimedia.org/T242828) (owner: 10Volans) [09:24:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1100', diff saved to https://phabricator.wikimedia.org/P10171 and previous config saved to /var/cache/conftool/dbconfig/20200116-092409-root.json [09:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:21] (03PS1) 10Muehlenhoff: Fix copy&paste error [puppet] - 10https://gerrit.wikimedia.org/r/565235 [09:27:27] (03CR) 10Volans: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/565061 (https://phabricator.wikimedia.org/T242828) (owner: 10Volans) [09:27:55] (03CR) 10Volans: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/565231 (owner: 10Alexandros Kosiaris) [09:28:54] (03CR) 10Volans: [C: 04-1] "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/565233 (owner: 10Alexandros Kosiaris) [09:29:11] (03PS1) 10Ema: cache: enable systemd resources accounting on cp4027 [puppet] - 10https://gerrit.wikimedia.org/r/565236 (https://phabricator.wikimedia.org/T183146) [09:29:34] (03Abandoned) 10Volans: ganeti: add support to PoPs DCs [puppet] - 10https://gerrit.wikimedia.org/r/565061 (https://phabricator.wikimedia.org/T242828) (owner: 10Volans) [09:30:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:31:00] (03CR) 10Muehlenhoff: [C: 03+2] Fix copy&paste error [puppet] - 10https://gerrit.wikimedia.org/r/565235 (owner: 10Muehlenhoff) [09:33:37] (03CR) 10Ema: [C: 03+2] cache: enable systemd resources accounting on cp4027 [puppet] - 10https://gerrit.wikimedia.org/r/565236 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [09:34:27] moritzm: ok to puppet-merge your change along with mine? [09:34:44] I was about to ask the same, please do [09:35:10] the unknowns in icinga should recover soon btw [09:35:19] moritzm: ack, done [09:38:02] RECOVERY - Prometheus bast3004/ops restarted: beware possible monitoring artifacts on bast3004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [09:45:02] (03CR) 10Elukey: [C: 03+2] Add mediawiki-history-dumps rsync to labstore [puppet] - 10https://gerrit.wikimedia.org/r/564066 (owner: 10Joal) [09:45:16] Cc: apergos: --^ [09:47:01] okey dokey, bstorm_ should have the real heads up though [09:49:24] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:50:18] there is a puppet mistake, alarms expected, we are fixing it [09:50:19] :) [09:52:03] all right [09:53:48] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:55:30] (03PS1) 10Muehlenhoff: Install es2020 with stretch-installer-bootif TFTP environment [puppet] - 10https://gerrit.wikimedia.org/r/565237 (https://phabricator.wikimedia.org/T242481) [09:56:08] (03PS1) 10Joal: Correct labstore hdfs-rsync [puppet] - 10https://gerrit.wikimedia.org/r/565238 [09:57:18] (03CR) 10Marostegui: [C: 03+1] Install es2020 with stretch-installer-bootif TFTP environment [puppet] - 10https://gerrit.wikimedia.org/r/565237 (https://phabricator.wikimedia.org/T242481) (owner: 10Muehlenhoff) [09:58:31] 10Operations, 10observability, 10Patch-For-Review: Monitor resource usage on a per-cgroup basis - https://phabricator.wikimedia.org/T183146 (10ema) I have enabled cpu, memory, and blockio cgroups accounting on cp4026 `Jan 16 09:19:59` and cp4027 `Jan 16 09:37:00`. We can now observe if the change has any im... [10:00:28] (03CR) 10Elukey: [C: 03+2] Correct labstore hdfs-rsync [puppet] - 10https://gerrit.wikimedia.org/r/565238 (owner: 10Joal) [10:02:26] (03PS1) 10Vgutierrez: lvs: Add missing ncredir@ulsfo icinga configuration [puppet] - 10https://gerrit.wikimedia.org/r/565239 (https://phabricator.wikimedia.org/T242321) [10:03:08] (03PS2) 10Muehlenhoff: Install es2020 with stretch-installer-bootif TFTP environment [puppet] - 10https://gerrit.wikimedia.org/r/565237 (https://phabricator.wikimedia.org/T242481) [10:03:34] (03CR) 10Vgutierrez: [C: 03+2] lvs: Add missing ncredir@ulsfo icinga configuration [puppet] - 10https://gerrit.wikimedia.org/r/565239 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [10:07:35] (03CR) 10Muehlenhoff: [C: 03+2] Install es2020 with stretch-installer-bootif TFTP environment [puppet] - 10https://gerrit.wikimedia.org/r/565237 (https://phabricator.wikimedia.org/T242481) (owner: 10Muehlenhoff) [10:07:48] (03PS1) 10Vgutierrez: lvs: Add eqsin ncredir configuration [puppet] - 10https://gerrit.wikimedia.org/r/565240 (https://phabricator.wikimedia.org/T242321) [10:07:51] (03CR) 10Alexandros Kosiaris: ganeti: Remove makevm.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/565233 (owner: 10Alexandros Kosiaris) [10:08:23] (03PS2) 10Alexandros Kosiaris: ganeti: Deprecate makevm.sh [puppet] - 10https://gerrit.wikimedia.org/r/565231 [10:08:25] (03PS2) 10Alexandros Kosiaris: ganeti: absent to the makevm script [puppet] - 10https://gerrit.wikimedia.org/r/565232 [10:08:27] (03PS2) 10Alexandros Kosiaris: ganeti: Remove makevm.sh [puppet] - 10https://gerrit.wikimedia.org/r/565233 [10:09:12] (03CR) 10Muehlenhoff: [C: 03+1] ganeti: Remove makevm.sh [puppet] - 10https://gerrit.wikimedia.org/r/565233 (owner: 10Alexandros Kosiaris) [10:10:30] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 46968712 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:10:33] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/20382/" [puppet] - 10https://gerrit.wikimedia.org/r/565240 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [10:11:05] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good, I've used the cookbook many times successfully, it fits all needs" [puppet] - 10https://gerrit.wikimedia.org/r/565231 (owner: 10Alexandros Kosiaris) [10:11:20] (03CR) 10Muehlenhoff: [C: 03+1] ganeti: absent to the makevm script [puppet] - 10https://gerrit.wikimedia.org/r/565232 (owner: 10Alexandros Kosiaris) [10:11:27] moritzm: actually makevm doesn't work in esams for instance [10:11:46] so +1 to deprecating it [10:12:10] even more reason, yep [10:12:10] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 42440 and 83 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:12:12] all this is because I sent a patch to fix that :D [10:14:51] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es2020.codfw.wmnet'] ` The... [10:15:37] (03PS1) 10Ema: cache: remove cp1071, cp1099 hieradata [puppet] - 10https://gerrit.wikimedia.org/r/565241 (https://phabricator.wikimedia.org/T229586) [10:17:08] (03PS1) 10Joal: Correct labstore hdfs-rsync [puppet] - 10https://gerrit.wikimedia.org/r/565242 [10:17:46] (03CR) 10Elukey: [C: 03+2] Correct labstore hdfs-rsync [puppet] - 10https://gerrit.wikimedia.org/r/565242 (owner: 10Joal) [10:19:37] rsync between hdfs and labstore running, so far all good [10:20:39] excellent [10:21:35] volans: stop breaking things ;P [10:21:44] lol [10:23:05] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.29 [software/spicerack] - 10https://gerrit.wikimedia.org/r/565244 [10:23:55] vgutierrez: send a patch to fix volans! [10:23:56] (03PS1) 10Elukey: dumps::web::fetches::stats: add another $$ to the rsync command [puppet] - 10https://gerrit.wikimedia.org/r/565245 [10:24:33] (03CR) 10Elukey: [C: 03+2] dumps::web::fetches::stats: add another $$ to the rsync command [puppet] - 10https://gerrit.wikimedia.org/r/565245 (owner: 10Elukey) [10:24:43] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Core Platform Team Legacy (Watching / External), and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871 (10TheDJ) 05Open→03Resolved I think this can be closed as OCG is... [10:24:44] here's a patch for you to fix things :D [10:26:49] (03CR) 10jerkins-bot: [V: 04-1] CHANGELOG: add changelogs for release v0.0.29 [software/spicerack] - 10https://gerrit.wikimedia.org/r/565244 (owner: 10Volans) [10:27:33] 10Operations, 10Collection, 10OfflineContentGenerator, 10Core Platform Team Legacy (Watching / External), and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872 (10TheDJ) 05Open→03Resolved a:03TheDJ OCG was replaced, don't see a need to keep this o... [10:27:37] (03PS2) 10Volans: CHANGELOG: add changelogs for release v0.0.29 [software/spicerack] - 10https://gerrit.wikimedia.org/r/565244 [10:27:42] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Core Platform Team Legacy (Watching / External), and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871 (10TheDJ) [10:33:04] (03CR) 10DCausse: [C: 04-1] A/B test for MLR models (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559614 (https://phabricator.wikimedia.org/T219534) (owner: 10Mstyles) [10:33:11] (03CR) 10Vgutierrez: [C: 03+1] CHANGELOG: add changelogs for release v0.0.29 [software/spicerack] - 10https://gerrit.wikimedia.org/r/565244 (owner: 10Volans) [10:33:34] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.29 [software/spicerack] - 10https://gerrit.wikimedia.org/r/565244 (owner: 10Volans) [10:35:09] (03CR) 10Volans: [C: 03+1] "Change looks sane to me. I'm not that familiar with this part of the code to be able to say 100% that there aren't any missing bits though" [puppet] - 10https://gerrit.wikimedia.org/r/565240 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [10:37:23] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.29 [software/spicerack] - 10https://gerrit.wikimedia.org/r/565244 (owner: 10Volans) [10:40:55] 10Operations, 10Collection, 10OfflineContentGenerator, 10Core Platform Team Legacy (Watching / External), and 2 others: Remove deprecated features from book creator UI - https://phabricator.wikimedia.org/T150917 (10TheDJ) 05Open→03Declined The deprecated features were not removed. I have no idea if tho... [10:40:58] 10Operations, 10Collection, 10OfflineContentGenerator, 10Core Platform Team Legacy (Watching / External), and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872 (10TheDJ) [10:41:01] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Core Platform Team Legacy (Watching / External), and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871 (10TheDJ) [10:43:30] (03PS2) 10Muehlenhoff: Switch url-downloader.eqiad to urldownloader1001 [dns] - 10https://gerrit.wikimedia.org/r/565033 (https://phabricator.wikimedia.org/T224551) [10:45:13] (03CR) 10Muehlenhoff: [C: 03+2] Switch url-downloader.eqiad to urldownloader1001 [dns] - 10https://gerrit.wikimedia.org/r/565033 (https://phabricator.wikimedia.org/T224551) (owner: 10Muehlenhoff) [10:45:36] (03PS1) 10Volans: Upstream release v0.0.29 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/565246 [10:56:30] PROBLEM - MariaDB Slave Lag: s8 on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 588.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [10:57:44] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Core Platform Team Legacy (Watching / External), and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871 (10phuedx) Thanks, @TheDJ! [10:59:00] RECOVERY - MariaDB Slave Lag: s8 on dbstore1005 is OK: OK slave_sql_lag Replication lag: 0.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [11:02:33] (03PS1) 10Muehlenhoff: Create separate role for repository servers [puppet] - 10https://gerrit.wikimedia.org/r/565249 (https://phabricator.wikimedia.org/T224576) [11:04:47] PROBLEM - Check systemd state on urldownloader1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:05] PROBLEM - Check systemd state on urldownloader2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:26] (03PS1) 10Muehlenhoff: profile::aptrepo::wikimedia: Use types for arguments [puppet] - 10https://gerrit.wikimedia.org/r/565251 [11:06:30] urldownloader1002/2002 is logrotate failing to rotate empty logs of (as these are fallback servers) [11:07:32] (03PS1) 10Giuseppe Lavagetto: Allow sending v2 requests to the registry [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/565252 [11:08:37] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.29 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/565246 (owner: 10Volans) [11:08:45] (03CR) 10Vgutierrez: [C: 03+2] lvs: Add eqsin ncredir configuration [puppet] - 10https://gerrit.wikimedia.org/r/565240 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:09:39] (03CR) 10Ema: "noop: https://puppet-compiler.wmflabs.org/compiler1002/20383/" [puppet] - 10https://gerrit.wikimedia.org/r/565241 (https://phabricator.wikimedia.org/T229586) (owner: 10Ema) [11:09:49] PROBLEM - Check systemd state on urldownloader1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:49] PROBLEM - Check systemd state on urldownloader2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:05] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes:weight=1; selector: service=nginx,name=ncredir5001.eqsin.wmnet [11:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:13] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes:weight=1; selector: service=nginx,name=ncredir5002.eqsin.wmnet [11:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] cache: remove cp1071, cp1099 hieradata [puppet] - 10https://gerrit.wikimedia.org/r/565241 (https://phabricator.wikimedia.org/T229586) (owner: 10Ema) [11:12:42] (03PS1) 10Alexandros Kosiaris: Remove etcd100[456] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/565253 (https://phabricator.wikimedia.org/T239835) [11:12:48] (03Merged) 10jenkins-bot: Upstream release v0.0.29 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/565246 (owner: 10Volans) [11:13:08] !log restarting pybal on lvs5003 (secondary LVS) - T242321 [11:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:11] T242321: Provide non-canonical-redirect service from every datacenter - https://phabricator.wikimedia.org/T242321 [11:14:09] (03PS1) 10Elukey: aptrepo: fix bigtop14's repo settings [puppet] - 10https://gerrit.wikimedia.org/r/565254 [11:15:03] (03PS2) 10Alexandros Kosiaris: Remove etcd100[456] [puppet] - 10https://gerrit.wikimedia.org/r/565253 (https://phabricator.wikimedia.org/T239835) [11:15:26] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2020.codfw.wmnet'] ` Of which those **FAILED**: ` ['es2020.codfw.wmnet'] ` [11:15:46] (03CR) 10Elukey: [C: 03+2] aptrepo: fix bigtop14's repo settings [puppet] - 10https://gerrit.wikimedia.org/r/565254 (owner: 10Elukey) [11:16:21] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [11:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:26] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [11:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:34] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [11:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:17] !log restarting pybal on lvs5001 (high-traffic1) - T242321 [11:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:25] (03CR) 10Muehlenhoff: Remove etcd100[456] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/565253 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [11:18:10] !log uploaded spicerack_0.0.29-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [11:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:34] (03PS3) 10Alexandros Kosiaris: Remove etcd100[456] [puppet] - 10https://gerrit.wikimedia.org/r/565253 (https://phabricator.wikimedia.org/T239835) [11:19:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Allow sending v2 requests to the registry [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/565252 (owner: 10Giuseppe Lavagetto) [11:20:20] (03CR) 10Muehlenhoff: [C: 03+1] Remove etcd100[456] [puppet] - 10https://gerrit.wikimedia.org/r/565253 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [11:20:24] (03Merged) 10jenkins-bot: Allow sending v2 requests to the registry [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/565252 (owner: 10Giuseppe Lavagetto) [11:20:44] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [11:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:39] (03PS1) 10Giuseppe Lavagetto: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/565255 [11:21:51] (03PS1) 10Vgutierrez: Pool eqsin for ncredir service [dns] - 10https://gerrit.wikimedia.org/r/565256 (https://phabricator.wikimedia.org/T242321) [11:22:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/565255 (owner: 10Giuseppe Lavagetto) [11:22:20] 10Operations, 10Quality-and-Test-Engineering-Team (QTE), 10Wikimedia-Mailing-lists, 10User-zeljkofilipin: Close QA mailing list - https://phabricator.wikimedia.org/T237383 (10zeljkofilipin) @akosiaris please delete the list and keep the archives. [11:22:28] !log import packages in stretch-wikimedia's thirdparty/bigtop14 component [11:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:21] (03Merged) 10jenkins-bot: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/565255 (owner: 10Giuseppe Lavagetto) [11:24:16] 10Operations, 10Quality-and-Test-Engineering-Team (QTE), 10Wikimedia-Mailing-lists, 10User-zeljkofilipin: Close QA mailing list - https://phabricator.wikimedia.org/T237383 (10akosiaris) 05Open→03Resolved a:03akosiaris Done. The archives are still present at https://lists.wikimedia.org/pipermail/qa/.... [11:26:30] RECOVERY - Check systemd state on urldownloader1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:42] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove etcd100[456] [puppet] - 10https://gerrit.wikimedia.org/r/565253 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [11:26:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/565253 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [11:27:10] PROBLEM - MariaDB Slave Lag: s8 on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 557.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [11:27:33] !log delete etcd100{4,5,6} from ganeti01.svc.eqiad.wmnet. T239835 [11:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:37] T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 [11:27:41] !log delete etcd100{4,5,6} from netbox. T239835 [11:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:39] <_joe_> !log uploading docker-report 0.0.3 [11:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:44] 10Operations, 10Quality-and-Test-Engineering-Team (QTE), 10Wikimedia-Mailing-lists, 10User-zeljkofilipin: Close QA mailing list - https://phabricator.wikimedia.org/T237383 (10zeljkofilipin) Thanks! [11:29:56] RECOVERY - MariaDB Slave Lag: s8 on dbstore1005 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [11:34:12] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:34:42] RECOVERY - Check systemd state on urldownloader2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:39:10] (03PS1) 10Arturo Borrero Gonzalez: kubernetes: ingress: introduce annotation to redirect the webapp root [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) [11:39:52] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: ingress: introduce annotation to redirect the webapp root [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) (owner: 10Arturo Borrero Gonzalez) [11:54:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P10172 and previous config saved to /var/cache/conftool/dbconfig/20200116-115420-marostegui.json [11:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:32] (03PS1) 10Mvolz: Update citoid to 30f793422 [deployment-charts] - 10https://gerrit.wikimedia.org/r/565261 [11:55:58] (03PS2) 10Mvolz: Update citoid to 30f793422 [deployment-charts] - 10https://gerrit.wikimedia.org/r/565261 [11:59:02] (03PS3) 10Mvolz: Update citoid to 30f793422 (staging cluster only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565261 [11:59:40] <_joe_> !log delete mediawiki-core images from october 2019 T242775 [11:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:43] T242775: Hundreds of tags for `wikimedia/mediawiki-core` image - https://phabricator.wikimedia.org/T242775 [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T1200). [12:00:04] Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:01:23] o/ [12:01:23] (03PS1) 10Mvolz: Update zotero to 5953b26 (staging only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565262 [12:02:22] I guess I deploy mines [12:04:11] Amir1: ping me when you're done please [12:04:47] sure [12:05:07] (03PS2) 10Ladsgroup: Set read for items in Wikidata for new term store up to Q8M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565074 (https://phabricator.wikimedia.org/T225057) [12:05:12] (03CR) 10Ladsgroup: [C: 03+2] Set read for items in Wikidata for new term store up to Q8M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565074 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [12:05:47] thanks [12:06:33] (03Merged) 10jenkins-bot: Set read for items in Wikidata for new term store up to Q8M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565074 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [12:08:42] marostegui: Some changes are going to affect s8, increase rows read. It will improve later today once I backport a fix [12:09:56] ok, thanks for the heads up [12:10:00] (03PS2) 10Arturo Borrero Gonzalez: kubernetes: ingress: introduce annotation to redirect the webapp root [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) [12:10:12] (03PS2) 10Ladsgroup: Stop writing to wb_terms for properties in Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565073 (https://phabricator.wikimedia.org/T225054) [12:10:31] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [12:10:36] (03PS16) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [12:10:38] (03PS13) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [12:10:57] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:565074|Set read for items in Wikidata for new term store up to Q8M (T225057)]] (duration: 01m 07s) [12:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:01] T225057: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225057 [12:11:24] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to wb_terms for properties in Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565073 (https://phabricator.wikimedia.org/T225054) (owner: 10Ladsgroup) [12:11:32] jouncebot: next [12:11:32] In 0 hour(s) and 48 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T1300) [12:12:22] (03Merged) 10jenkins-bot: Stop writing to wb_terms for properties in Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565073 (https://phabricator.wikimedia.org/T225054) (owner: 10Ladsgroup) [12:14:37] !log installing OpenSSL security updates [12:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:48] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:565073|Stop writing to wb_terms for properties in Test Wikidata (T225054)]] (duration: 01m 04s) [12:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:51] T225054: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_NEW - https://phabricator.wikimedia.org/T225054 [12:16:47] !log Updgrade jobrunners to php 7.2.26 and restart - T241222 [12:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:50] T241222: Update Wikimedia production to PHP 7.2.26 - https://phabricator.wikimedia.org/T241222 [12:17:22] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Another sync for the IS.php cache issue (duration: 01m 04s) [12:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:09] 10Operations: Anycast for webproxies - https://phabricator.wikimedia.org/T242715 (10faidon) I think increasing the availability and resilience of this service is an excellent idea! However, adding more servers to per site feels like a requirement, and a standard Pybal/IPVS setup sounds much more appropriate than... [12:19:32] !log "delete from testwikidatawiki.wb_terms where term_full_entity_id like 'P%'" (T219301 T225054) [12:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:36] T219301: Migrate to and read from new store for property terms - https://phabricator.wikimedia.org/T219301 [12:20:23] (I'm doing it in batches to avoid replag on s3) [12:21:40] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es2020.codfw.wmnet'] ` The... [12:21:50] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2020.codfw.wmnet'] ` Of which those **FAILED**: ` ['es2020.codfw.wmnet'] ` [12:22:04] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es2020.codfw.wmnet'] ` The... [12:23:05] (03CR) 10Giuseppe Lavagetto: "The change is an effective noop on the load balancers and on a server with lvs::realserver:" [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [12:23:29] !log restart php-fpm on labweb* [12:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:55] (03PS3) 10Arturo Borrero Gonzalez: kubernetes: ingress: introduce annotation to redirect the webapp root [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) [12:24:37] (03CR) 10Vgutierrez: [C: 03+1] "LGTM, thanks for taking care of this <3" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [12:26:31] Urbanecm: the floor is yours [12:26:40] thank you Amir1 [12:27:16] (03PS2) 10Urbanecm: Configure GlobalRename blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564102 (https://phabricator.wikimedia.org/T101615) [12:27:20] (03CR) 10Urbanecm: [C: 03+2] Configure GlobalRename blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564102 (https://phabricator.wikimedia.org/T101615) (owner: 10Urbanecm) [12:28:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P10174 and previous config saved to /var/cache/conftool/dbconfig/20200116-122806-marostegui.json [12:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:26] (03Merged) 10jenkins-bot: Configure GlobalRename blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564102 (https://phabricator.wikimedia.org/T101615) (owner: 10Urbanecm) [12:28:44] (03PS4) 10Arturo Borrero Gonzalez: kubernetes: ingress: introduce annotation to redirect the webapp root [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) [12:28:58] Amir1: outstanding commit in /srv/mediawiki-stagging [12:29:10] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1001/20385/ no significant changes on the puppetmasters or the icinga hosts." [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [12:29:21] Urbanecm: shoot, what is it? [12:29:29] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: ingress: introduce annotation to redirect the webapp root [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) (owner: 10Arturo Borrero Gonzalez) [12:29:33] Amir1: here you are https://www.irccloud.com/pastebin/CNKMRo88/ [12:29:41] yup, sorry [12:29:45] let me do it again [12:29:59] Amir1: sure. I've already merged a patch I want to deploy, just FYI [12:30:16] I sometimes forget git rebase [12:30:22] is it IS.php? [12:30:42] my patch? Yes [12:30:55] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:565073|Stop writing to wb_terms for properties in Test Wikidata (T225054)]] (duration: 01m 05s) [12:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:58] T225054: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_NEW - https://phabricator.wikimedia.org/T225054 [12:31:12] Urbanecm: done [12:31:25] (03PS5) 10Arturo Borrero Gonzalez: kubernetes: ingress: introduce annotation to redirect the webapp root [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) [12:31:37] thanks Amir1 [12:31:56] Thanks for catching it, sorry [12:32:56] Np, happens to everyone ig [12:33:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 65e17eb: Configure GlobalRename blacklist (T101615) (duration: 01m 05s) [12:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:43] T101615: Create a blacklist of user who can not use Special:GlobalRenameRequest - https://phabricator.wikimedia.org/T101615 [12:37:03] (03PS1) 10Marostegui: install_server: Add stretch-installer-bootif to es20XX hosts [puppet] - 10https://gerrit.wikimedia.org/r/565265 (https://phabricator.wikimedia.org/T242481) [12:37:35] jouncebot: next [12:37:35] In 0 hour(s) and 22 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T1300) [12:38:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [12:38:09] (03PS2) 10Urbanecm: Add `Tutoriel` namespace for French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/563418 (https://phabricator.wikimedia.org/T242102) (owner: 10Ammarpad) [12:38:15] (03CR) 10Urbanecm: [C: 03+2] Add `Tutoriel` namespace for French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/563418 (https://phabricator.wikimedia.org/T242102) (owner: 10Ammarpad) [12:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "after some back and forth with the syntax and the linter, this should be ready for review." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) (owner: 10Arturo Borrero Gonzalez) [12:38:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/565265 (https://phabricator.wikimedia.org/T242481) (owner: 10Marostegui) [12:38:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P10175 and previous config saved to /var/cache/conftool/dbconfig/20200116-123841-marostegui.json [12:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:46] jouncebot: now [12:38:46] For the next 0 hour(s) and 21 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T1200) [12:38:52] (03CR) 10Marostegui: [C: 03+2] install_server: Add stretch-installer-bootif to es20XX hosts [puppet] - 10https://gerrit.wikimedia.org/r/565265 (https://phabricator.wikimedia.org/T242481) (owner: 10Marostegui) [12:39:18] (03Merged) 10jenkins-bot: Add `Tutoriel` namespace for French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/563418 (https://phabricator.wikimedia.org/T242102) (owner: 10Ammarpad) [12:39:59] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Marostegui) @Papaul you can proceed with the installation of es2021, es2022, es2023, es2024 and es2025 es2020 is installed already Thanks! [12:40:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:00] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 7381446: Add `Tutoriel` namespace for French Wiktionary (T242102) (duration: 01m 04s) [12:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:03] T242102: New namespace for French Wiktionary - https://phabricator.wikimedia.org/T242102 [12:44:28] (03CR) 10Urbanecm: [C: 03+2] Upload HD Logo for fawikivoyage, jawikiquote and cywikiquote. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [12:44:30] (03CR) 10Urbanecm: [C: 03+2] Upload HD logos for hi, la and no wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559344 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [12:45:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1080', diff saved to https://phabricator.wikimedia.org/P10176 and previous config saved to /var/cache/conftool/dbconfig/20200116-124516-marostegui.json [12:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:20] (03Merged) 10jenkins-bot: Upload HD Logo for fawikivoyage, jawikiquote and cywikiquote. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560380 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [12:45:23] (03Merged) 10jenkins-bot: Upload HD logos for hi, la and no wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559344 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [12:46:04] (03PS7) 10Urbanecm: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [12:46:12] (03CR) 10Urbanecm: [C: 03+2] Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [12:46:33] (03PS4) 10Urbanecm: Add wgLogoHD entry for fa, te wikiquote & fr wikisource in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560556 (owner: 10Minhducsun2002) [12:46:43] (03CR) 10Urbanecm: [C: 03+2] Add wgLogoHD entry for fa, te wikiquote & fr wikisource in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560556 (owner: 10Minhducsun2002) [12:46:55] (03CR) 10jerkins-bot: [V: 04-1] Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [12:47:04] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2020.codfw.wmnet'] ` and were **ALL** successful. [12:47:30] (03CR) 10jerkins-bot: [V: 04-1] Add wgLogoHD entry for fa, te wikiquote & fr wikisource in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560556 (owner: 10Minhducsun2002) [12:47:55] 10Operations, 10serviceops, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10jijiki) Yeah, we need at least a total of 4 api and 4 app canary servers in codfw. In eqiad our canary app (5) and api (4) servers are in the same rack actually, we can spread them a b... [12:48:07] (03CR) 10Urbanecm: [C: 03+2] Upload HD Logo for 9 Wikibooks Projects and 1 Wikipeida Project: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560577 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [12:49:01] (03Merged) 10jenkins-bot: Upload HD Logo for 9 Wikibooks Projects and 1 Wikipeida Project: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560577 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [12:49:41] (03CR) 10Urbanecm: [C: 03+2] Upload HD logos for fa, te wikiquote & fr wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560555 (owner: 10Minhducsun2002) [12:50:29] (03Merged) 10jenkins-bot: Upload HD logos for fa, te wikiquote & fr wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560555 (owner: 10Minhducsun2002) [12:50:48] (03PS8) 10Urbanecm: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [12:51:11] (03PS5) 10Urbanecm: Add wgLogoHD entry for fa, te wikiquote & fr wikisource in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560556 (owner: 10Minhducsun2002) [12:51:44] (03CR) 10Effie Mouzeli: [C: 03+1] define 2 API appservers per row in codfw as canary API appservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564175 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [12:51:49] !log remove BGP sessions to AS22652 in eqiad (left the IX) [12:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:50] (03CR) 10Minhducsun2002: "Seems like it depends on Ia26a2ebeeb3ed826827b45f816b41db7ab4d249c." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560556 (owner: 10Minhducsun2002) [12:54:59] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: Sync project logos (duration: 01m 06s) [12:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:54] (03CR) 10Urbanecm: [C: 03+2] Add wgLogoHD entry for fa, te wikiquote & fr wikisource in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560556 (owner: 10Minhducsun2002) [12:56:01] (03CR) 10Urbanecm: [C: 03+2] Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [12:56:55] (03Merged) 10jenkins-bot: Add wgLogoHD entry for fa, te wikiquote & fr wikisource in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560556 (owner: 10Minhducsun2002) [12:57:40] (03PS9) 10Urbanecm: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T1300) [13:00:14] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 940b9a2: Add wgLogoHD entry for fa, te wikiquote & fr wikisource in IS.php (duration: 01m 05s) [13:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:24] * Urbanecm is going to steal a little from sanity window unfortunately [13:02:01] 10Operations, 10serviceops, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10MoritzMuehlenhoff) Agreed, I think for our uses of the canaries, rack redundancy is not a must, but would still be nice to have when re-adding canaries to codfw. [13:02:33] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: aedd2c4: Add HD logos to IS.php (duration: 01m 04s) [13:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:33] (03CR) 10Ema: [C: 03+1] Pool eqsin for ncredir service [dns] - 10https://gerrit.wikimedia.org/r/565256 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [13:09:25] !log EU SWAT done late [13:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:37] 10Operations, 10netops, 10Patch-For-Review: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) 05Open→03Resolved We now reject **all** RPKI invalid, from peering and transits, without any default route. So far everything looks good. Blogpost to follow in the next few days. [13:17:29] PROBLEM - grafana.wikimedia.org on grafana1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [13:17:37] PROBLEM - Check systemd state on grafana1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:16] ^ that was me, looking what went wrong [13:24:01] RECOVERY - Check systemd state on grafana1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:04] (03PS1) 10Urbanecm: Fix mistakes in HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565268 (https://phabricator.wikimedia.org/T150618) [13:29:07] (03PS1) 10Urbanecm: Add logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565269 (https://phabricator.wikimedia.org/T150618) [13:29:51] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) You guys think this will be ready by 31st Jan? Thanks. [13:30:36] (03CR) 10jerkins-bot: [V: 04-1] Add logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565269 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:30:45] !log restarting Swift frontends to pick up OpenSSL security update [13:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:12] (03PS2) 10Urbanecm: Add logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565269 (https://phabricator.wikimedia.org/T150618) [13:35:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1097:3314 db1097:3315', diff saved to https://phabricator.wikimedia.org/P10178 and previous config saved to /var/cache/conftool/dbconfig/20200116-133515-marostegui.json [13:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:11] !log Upgrade db1097:3314 db1097:3315 [13:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:01] (03PS1) 10Muehlenhoff: Remove actinium|alcyone|alsafi|aluminium [puppet] - 10https://gerrit.wikimedia.org/r/565271 (https://phabricator.wikimedia.org/T224551) [13:40:26] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [13:40:27] PROBLEM - Check systemd state on grafana1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1097:3314 db1097:3315', diff saved to https://phabricator.wikimedia.org/P10179 and previous config saved to /var/cache/conftool/dbconfig/20200116-134801-marostegui.json [13:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:51] RECOVERY - Check systemd state on grafana1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:55] (03PS3) 10Gehel: increase spacing between osm replication [puppet] - 10https://gerrit.wikimedia.org/r/565013 (owner: 10MSantos) [13:57:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1097:3314 db1097:3315', diff saved to https://phabricator.wikimedia.org/P10180 and previous config saved to /var/cache/conftool/dbconfig/20200116-135659-marostegui.json [13:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:10] (03CR) 10Gehel: [C: 03+2] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/20388/maps1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/565013 (owner: 10MSantos) [14:00:04] liw and brennen: That opportune time is upon us again. Time for a Mediawiki train - European+American Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T1400). [14:00:41] (03PS1) 10Lars Wirzenius: all wikis to 1.35.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565275 [14:00:43] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.35.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565275 (owner: 10Lars Wirzenius) [14:01:46] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565275 (owner: 10Lars Wirzenius) [14:04:16] !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.15 [14:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:00] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10MoritzMuehlenhoff) So, this will be addressed in two parts: * We'll enable "ipappend 2" for Buster: https://gerrit.wikimedia.org/r/#/c/o... [14:05:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1097:3314 db1097:3315', diff saved to https://phabricator.wikimedia.org/P10181 and previous config saved to /var/cache/conftool/dbconfig/20200116-140501-marostegui.json [14:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:55] (03CR) 10Andrew Bogott: "gerrit's weird diff behavior is making this change look much bigger than it is. I'll see if I can work around that." [puppet] - 10https://gerrit.wikimedia.org/r/565043 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [14:11:54] (03PS1) 10Elukey: Deploy Apache BigTop's apt repository on analytics1031 [puppet] - 10https://gerrit.wikimedia.org/r/565279 [14:18:13] (03PS1) 10Elukey: admin: add isaacj to gpu-testers [puppet] - 10https://gerrit.wikimedia.org/r/565280 [14:18:37] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging data for knissen - https://phabricator.wikimedia.org/T241838 (10Nuria) @kai.nissen Please read data access gudelines, https://wikitech.wikimedia.org/wiki/Analytics/Data_Access_Guidelines and familirize yourself with ha... [14:19:35] (03CR) 10Nuria: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/565280 (owner: 10Elukey) [14:20:26] (03CR) 10Elukey: [C: 03+2] admin: add isaacj to gpu-testers [puppet] - 10https://gerrit.wikimedia.org/r/565280 (owner: 10Elukey) [14:23:50] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/20389/" [puppet] - 10https://gerrit.wikimedia.org/r/565279 (owner: 10Elukey) [14:24:30] (03CR) 10Elukey: [C: 03+2] Deploy Apache BigTop's apt repository on analytics1031 [puppet] - 10https://gerrit.wikimedia.org/r/565279 (owner: 10Elukey) [14:28:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1097:3314 db1097:3315', diff saved to https://phabricator.wikimedia.org/P10182 and previous config saved to /var/cache/conftool/dbconfig/20200116-142800-marostegui.json [14:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:07] (03PS1) 10Filippo Giunchedi: Add restbase2[123] to restbase::production [puppet] - 10https://gerrit.wikimedia.org/r/565281 (https://phabricator.wikimedia.org/T241790) [14:29:21] (03PS1) 10Muehlenhoff: Remove DNS records for actinium|alcyone|alsafi|aluminium [dns] - 10https://gerrit.wikimedia.org/r/565282 (https://phabricator.wikimedia.org/T224551) [14:31:14] (03PS1) 10Joal: Add mediacount fetch using hdfs-rsync on labstore [puppet] - 10https://gerrit.wikimedia.org/r/565283 [14:31:30] elukey: --^ patch for mediacount :) [14:33:27] (03CR) 10Vgutierrez: [C: 03+2] Pool eqsin for ncredir service [dns] - 10https://gerrit.wikimedia.org/r/565256 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [14:33:54] (03PS3) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: retry if we encounter an exception [puppet] - 10https://gerrit.wikimedia.org/r/565044 (https://phabricator.wikimedia.org/T238766) [14:33:56] (03PS1) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: Partial refactor [puppet] - 10https://gerrit.wikimedia.org/r/565284 (https://phabricator.wikimedia.org/T238766) [14:33:58] (03PS1) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: further refactor [puppet] - 10https://gerrit.wikimedia.org/r/565285 (https://phabricator.wikimedia.org/T238766) [14:34:00] (03PS1) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: add a main() function [puppet] - 10https://gerrit.wikimedia.org/r/565286 [14:34:02] (03PS1) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: catch all exceptions [puppet] - 10https://gerrit.wikimedia.org/r/565287 (https://phabricator.wikimedia.org/T238766) [14:34:32] (03PS1) 10Muehlenhoff: profile::url_downloader: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/565288 (https://phabricator.wikimedia.org/T224551) [14:35:34] (03Abandoned) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: modest refactor [puppet] - 10https://gerrit.wikimedia.org/r/565043 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [14:35:37] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero) a:05aborrero→03None [14:36:02] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero) a:05aborrero→03None [14:36:11] (03CR) 10jerkins-bot: [V: 04-1] wmcs-dns-floating-ip-updater.py: add a main() function [puppet] - 10https://gerrit.wikimedia.org/r/565286 (owner: 10Andrew Bogott) [14:38:17] (03CR) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: Partial refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/565284 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [14:39:03] (03PS2) 10Muehlenhoff: admins: add Kai Nissen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/562940 (https://phabricator.wikimedia.org/T241838) (owner: 10Dzahn) [14:41:03] (03PS17) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [14:41:05] (03PS14) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [14:41:07] (03PS1) 10Giuseppe Lavagetto: lvs::monitor: use unique identifiers for services [puppet] - 10https://gerrit.wikimedia.org/r/565290 [14:42:50] (03CR) 10Muehlenhoff: [C: 03+2] admins: add Kai Nissen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/562940 (https://phabricator.wikimedia.org/T241838) (owner: 10Dzahn) [14:43:38] (03CR) 10jerkins-bot: [V: 04-1] lvs::monitor: use unique identifiers for services [puppet] - 10https://gerrit.wikimedia.org/r/565290 (owner: 10Giuseppe Lavagetto) [14:43:46] (03CR) 10Andrew Bogott: "promising" [puppet] - 10https://gerrit.wikimedia.org/r/564662 (owner: 10Andrew Bogott) [14:44:41] (03CR) 10Andrew Bogott: "eventually... https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/20392/console" [puppet] - 10https://gerrit.wikimedia.org/r/564662 (owner: 10Andrew Bogott) [14:47:40] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging data for knissen - https://phabricator.wikimedia.org/T241838 (10MoritzMuehlenhoff) 05Open→03Resolved @kai.nissen Your access is now enabled (but it can take up to 30 minutes until it has propagated to all servers),... [14:47:50] (03PS2) 10Giuseppe Lavagetto: lvs::monitor: use unique identifiers for services [puppet] - 10https://gerrit.wikimedia.org/r/565290 [14:49:57] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/20393/labstore1006.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/565283 (owner: 10Joal) [14:52:02] (03CR) 10Andrew Bogott: "also promising: https://puppet-compiler.wmflabs.org/compiler1003/20396/" [puppet] - 10https://gerrit.wikimedia.org/r/564662 (owner: 10Andrew Bogott) [14:55:55] (03CR) 10Giuseppe Lavagetto: "Puppet compiler diffs are all over the place of course, but I analyzed the outcomes from:" [puppet] - 10https://gerrit.wikimedia.org/r/565290 (owner: 10Giuseppe Lavagetto) [14:57:34] (03CR) 10BBlack: [C: 03+1] ATS: Disable TLSv1.0/1.1 support on the caching layer [puppet] - 10https://gerrit.wikimedia.org/r/562779 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [14:59:16] \o/ [15:02:13] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable TLSv1.0/1.1 support on the caching layer [puppet] - 10https://gerrit.wikimedia.org/r/562779 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [15:04:48] !log rolling restart of ats-tls. This effectively disables TLSv1/1.1 across the caching cluster - T238038 [15:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:51] T238038: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 [15:05:09] (03CR) 10Ppchelko: [C: 03+1] "Feel free to self-merge when you're ready for deloyment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/565261 (owner: 10Mvolz) [15:05:42] vgutierrez: you monster [15:06:01] it's always a pleasure getting feedback :D [15:06:46] heh [15:08:33] vgutierrez es un crack :P [15:08:58] :flush: [15:09:02] ahaha [15:09:06] 🎉 [15:09:33] vgutierrez: wow niceeee!! [15:09:56] really great result! [15:13:36] 10Operations, 10Traffic: Provide non-canonical-redirect service from every datacenter - https://phabricator.wikimedia.org/T242321 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [15:13:38] 10Operations, 10SRE-Access-Requests: Requesting access to EventLogging data for knissen - https://phabricator.wikimedia.org/T241838 (10MoritzMuehlenhoff) @kai.nissen You should have also received a mail for your Kerberos account (required to access Hadoop) with further instructions. [15:15:33] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@16a1644]: Upgrade to superset 0.35.2 [15:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:12] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@16a1644]: Upgrade to superset 0.35.2 (duration: 00m 40s) [15:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:51] (03PS1) 10ArielGlenn: for 7z production in batches, skip files that exist at beginning of each batch [dumps] - 10https://gerrit.wikimedia.org/r/565301 [15:26:07] (03PS1) 10Muehlenhoff: Add Jennifer Wang to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/565304 (https://phabricator.wikimedia.org/T242807) [15:32:15] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Jgreen) a:05Jgreen→03Papaul @Papaul I'm getting an error at pxeboot, looks like the cable is not connected or the network port is off? Can you take a look? Boo... [15:33:17] (03CR) 10Filippo Giunchedi: [C: 03+2] Add restbase2[123] to restbase::production [puppet] - 10https://gerrit.wikimedia.org/r/565281 (https://phabricator.wikimedia.org/T241790) (owner: 10Filippo Giunchedi) [15:35:55] (03CR) 10Ppchelko: [C: 03+1] "Feel free to self merge when ready to deploy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/565262 (owner: 10Mvolz) [15:38:10] 10Operations, 10SRE-Access-Requests: Requesting access to EventLogging data for knissen - https://phabricator.wikimedia.org/T241838 (10kai.nissen) @MoritzMuehlenhoff Login and querying works, thanks a lot! [15:40:56] (03PS1) 10Filippo Giunchedi: Add cassandra instances for restbase202[123] [dns] - 10https://gerrit.wikimedia.org/r/565305 (https://phabricator.wikimedia.org/T241790) [15:43:04] (03CR) 10Filippo Giunchedi: [C: 03+2] Add cassandra instances for restbase202[123] [dns] - 10https://gerrit.wikimedia.org/r/565305 (https://phabricator.wikimedia.org/T241790) (owner: 10Filippo Giunchedi) [15:44:39] (03PS1) 10Filippo Giunchedi: hieradata: add restbase202[123] instances [puppet] - 10https://gerrit.wikimedia.org/r/565307 (https://phabricator.wikimedia.org/T241790) [15:49:16] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add restbase202[123] instances [puppet] - 10https://gerrit.wikimedia.org/r/565307 (https://phabricator.wikimedia.org/T241790) (owner: 10Filippo Giunchedi) [15:49:27] PROBLEM - Check systemd state on grafana1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:10] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/565307 (https://phabricator.wikimedia.org/T241790) (owner: 10Filippo Giunchedi) [15:51:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users, researchers & wmf for jennifer wang (jwang) - https://phabricator.wikimedia.org/T242807 (10MoritzMuehlenhoff) @jwang : I already enabled your LDAP access via the "wmf" group, the services listed at http... [15:57:23] jouncebot: now [15:57:23] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [15:57:28] jouncebot: next [15:57:28] In 1 hour(s) and 2 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T1700) [15:59:10] !log Updgrade appservers and api to php 7.2.26 and restart - T241222 [15:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:13] T241222: Update Wikimedia production to PHP 7.2.26 - https://phabricator.wikimedia.org/T241222 [15:59:24] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:59:26] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:45] RECOVERY - Check systemd state on urldownloader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:09] RECOVERY - Check systemd state on urldownloader1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:23] RECOVERY - Check systemd state on grafana1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:56] (03CR) 10Vgutierrez: [C: 03+2] ssl_ciphersuite: Allow TLSv1/TLSv1.1 in compat mode only [puppet] - 10https://gerrit.wikimedia.org/r/551396 (https://phabricator.wikimedia.org/T238518) (owner: 10Vgutierrez) [16:08:11] (03PS1) 10Filippo Giunchedi: puppetmaster: install cassandra-ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/565312 [16:08:34] 10Operations, 10DC-Ops, 10decommission: decommission frdb2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242983 (10Jgreen) [16:08:42] 10Operations, 10Cloud-Services, 10DBA: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10Bstorm) [16:14:06] (03CR) 10Bstorm: "We should be able to live hack this into any tool on the new cluster. Bryan, you got any to try? I might try moving cdnjs today if I fin" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) (owner: 10Arturo Borrero Gonzalez) [16:14:26] (03CR) 10Tarrow: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/563410 (owner: 10Tarrow) [16:14:27] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20681880 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:16:23] (03PS4) 10Jakob: Termbox chart 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/563410 (owner: 10Tarrow) [16:16:45] (03CR) 10Bstorm: "> Patch Set 5:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) (owner: 10Arturo Borrero Gonzalez) [16:17:19] 10Operations, 10DC-Ops, 10decommission: decommission alnitak.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242990 (10Jgreen) [16:17:34] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 23568 and 75 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:18:01] (03PS1) 10Jgreen: remove DNS records for frdb2001 and alnitak [dns] - 10https://gerrit.wikimedia.org/r/565313 (https://phabricator.wikimedia.org/T242983) [16:18:54] (03CR) 10Jakob: [C: 03+2] Termbox chart 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/563410 (owner: 10Tarrow) [16:19:17] (03Merged) 10jenkins-bot: Termbox chart 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/563410 (owner: 10Tarrow) [16:19:18] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission frdb2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242983 (10Jgreen) [16:19:45] (03PS3) 10Tarrow: Docs: Add information on updating a chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/563412 [16:21:16] 10Operations, 10Traffic: Analyze the impact of removing TLSv1/v1.1 on puppetmasters - https://phabricator.wikimedia.org/T242991 (10Vgutierrez) [16:21:22] (03CR) 10Bstorm: kubernetes: ingress: introduce annotation to redirect the webapp root (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) (owner: 10Arturo Borrero Gonzalez) [16:21:28] 10Operations, 10Traffic: Analyze the impact of removing TLSv1/v1.1 on puppetmasters - https://phabricator.wikimedia.org/T242991 (10Vgutierrez) p:05Triage→03Normal [16:22:27] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission alnitak.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242990 (10Jgreen) [16:23:38] (03PS1) 10Tarrow: Use termbox 0.0.4 chart on test deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565315 [16:23:50] PROBLEM - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is CRITICAL: connect to address 10.192.16.153 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:24:02] (03PS1) 10Vgutierrez: ncredir: Remove TLSv1.0 && TLSv1.1 support [puppet] - 10https://gerrit.wikimedia.org/r/565316 (https://phabricator.wikimedia.org/T238518) [16:25:25] (03CR) 10Tarrow: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/565315 (owner: 10Tarrow) [16:26:09] (03CR) 10Jakob: [C: 03+2] Use termbox 0.0.4 chart on test deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565315 (owner: 10Tarrow) [16:26:12] PROBLEM - cassandra-a CQL 10.192.32.191:9042 on restbase2022 is CRITICAL: connect to address 10.192.32.191 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:26:14] PROBLEM - cassandra-a SSL 10.192.16.153:7001 on restbase2021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [16:26:59] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1001/20397/" [puppet] - 10https://gerrit.wikimedia.org/r/565316 (https://phabricator.wikimedia.org/T238518) (owner: 10Vgutierrez) [16:27:10] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Remove TLSv1.0 && TLSv1.1 support [puppet] - 10https://gerrit.wikimedia.org/r/565316 (https://phabricator.wikimedia.org/T238518) (owner: 10Vgutierrez) [16:27:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs-dns-floating-ip-updater.py: Partial refactor [puppet] - 10https://gerrit.wikimedia.org/r/565284 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [16:28:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs-dns-floating-ip-updater.py: further refactor [puppet] - 10https://gerrit.wikimedia.org/r/565285 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [16:28:32] PROBLEM - cassandra-a CQL 10.192.48.142:9042 on restbase2023 is CRITICAL: connect to address 10.192.48.142 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:28:32] PROBLEM - cassandra-a SSL 10.192.32.191:7001 on restbase2022 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [16:28:32] PROBLEM - cassandra-a service on restbase2021 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:28:46] (03PS2) 10Jakob: Use termbox 0.0.4 chart on test deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565315 (owner: 10Tarrow) [16:29:14] (03CR) 10Jakob: [C: 03+2] Use termbox 0.0.4 chart on test deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565315 (owner: 10Tarrow) [16:29:16] (03CR) 10Dzahn: "should i spread them out across rows (re: comments on ticket) ?" [puppet] - 10https://gerrit.wikimedia.org/r/564175 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [16:29:34] those cassandra issues are expected? [16:29:38] (03Merged) 10jenkins-bot: Use termbox 0.0.4 chart on test deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565315 (owner: 10Tarrow) [16:30:21] (03CR) 10Jgreen: [C: 03+2] remove DNS records for frdb2001 and alnitak [dns] - 10https://gerrit.wikimedia.org/r/565313 (https://phabricator.wikimedia.org/T242983) (owner: 10Jgreen) [16:30:50] vgutierrez: yeah that's me [16:30:52] !log jakob@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'test' . [16:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:55] downtimed [16:30:55] (03PS1) 10CDanis: grafana1001 to spare in prep for decom [puppet] - 10https://gerrit.wikimedia.org/r/565323 (https://phabricator.wikimedia.org/T242992) [16:30:58] thx <3 [16:31:19] (03CR) 10Muehlenhoff: [C: 03+1] grafana1001 to spare in prep for decom [puppet] - 10https://gerrit.wikimedia.org/r/565323 (https://phabricator.wikimedia.org/T242992) (owner: 10CDanis) [16:31:44] !log depooling wdqs1007, blazegraph stuck (T242453) [16:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:47] T242453: wdqs1005 stopped to handle updates - https://phabricator.wikimedia.org/T242453 [16:38:44] (03CR) 10CDanis: [C: 03+2] grafana1001 to spare in prep for decom [puppet] - 10https://gerrit.wikimedia.org/r/565323 (https://phabricator.wikimedia.org/T242992) (owner: 10CDanis) [16:49:23] (03PS1) 10Tarrow: Use termbox 0.0.4 chart on staging deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565342 [16:50:01] (03CR) 10Tarrow: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/565342 (owner: 10Tarrow) [16:50:25] (03CR) 10Jakob: [C: 03+2] Use termbox 0.0.4 chart on staging deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565342 (owner: 10Tarrow) [16:50:42] (03Merged) 10jenkins-bot: Use termbox 0.0.4 chart on staging deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565342 (owner: 10Tarrow) [16:51:00] (03PS1) 10Tarrow: Use termbox 0.0.4 chart on codfw prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565344 [16:51:13] (03CR) 10Tarrow: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/565344 (owner: 10Tarrow) [16:51:30] !log jakob@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'staging' . [16:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:25] (03PS1) 10Tarrow: Use termbox 0.0.4 chart on eqiad prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565346 [16:52:35] (03CR) 10Jakob: [C: 03+2] Use termbox 0.0.4 chart on codfw prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565344 (owner: 10Tarrow) [16:52:40] (03CR) 10Tarrow: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/565346 (owner: 10Tarrow) [16:52:52] (03Merged) 10jenkins-bot: Use termbox 0.0.4 chart on codfw prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565344 (owner: 10Tarrow) [16:53:42] !log jakob@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' . [16:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:27] (03CR) 10Jakob: [C: 03+2] Use termbox 0.0.4 chart on eqiad prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565346 (owner: 10Tarrow) [16:58:44] (03Merged) 10jenkins-bot: Use termbox 0.0.4 chart on eqiad prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/565346 (owner: 10Tarrow) [17:00:04] godog and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:02:55] !log jakob@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'termbox' for release 'production' . [17:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:42] !log restarting blazegraph@wdqs1007 (T242453) [17:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:45] T242453: wdqs1005 stopped to handle updates - https://phabricator.wikimedia.org/T242453 [17:07:32] (03PS2) 10CRusnov: puppetdb: Improve structure and separate VMs and devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565176 [17:08:16] (03PS4) 10Tarrow: Docs: Add information on updating a chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/563412 [17:09:02] (03CR) 10Tarrow: [C: 03+2] "as per Alex; rebase needed because ffwd only" [deployment-charts] - 10https://gerrit.wikimedia.org/r/563412 (owner: 10Tarrow) [17:09:19] (03Merged) 10jenkins-bot: Docs: Add information on updating a chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/563412 (owner: 10Tarrow) [17:14:31] (03CR) 10Dzahn: [C: 03+2] ganeti: Deprecate makevm.sh [puppet] - 10https://gerrit.wikimedia.org/r/565231 (owner: 10Alexandros Kosiaris) [17:21:12] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Jgreen) Note we have the same Debian vs unused 10G NIC problem documented here T242481. We're waiting for information from Dell on how to disable the 10G ports in B... [17:27:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kubernetes: ingress: introduce annotation to redirect the webapp root (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) (owner: 10Arturo Borrero Gonzalez) [17:28:45] (03CR) 10Dzahn: "btw there is a file called "typos" in the root of operations/puppet. not 100% sure if it cares about capitalization but if it does you cou" [puppet] - 10https://gerrit.wikimedia.org/r/565166 (owner: 10Legoktm) [17:38:58] jouncebot: now [17:38:58] For the next 0 hour(s) and 21 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T1700) [17:39:26] (03PS1) 10Jforrester: [trwiki] Change logo to reflect unblocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565350 (https://phabricator.wikimedia.org/T242932) [17:39:35] !log Updgrade parsoid to to php 7.2.26 and restart - T241222 [17:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:39] T241222: Update Wikimedia production to PHP 7.2.26 - https://phabricator.wikimedia.org/T241222 [17:40:07] I'm going to deploy the trwiki logo changes. [17:40:23] (03CR) 10Jforrester: [C: 03+2] [trwiki] Change logo to reflect unblocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565350 (https://phabricator.wikimedia.org/T242932) (owner: 10Jforrester) [17:41:14] (03Merged) 10jenkins-bot: [trwiki] Change logo to reflect unblocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565350 (https://phabricator.wikimedia.org/T242932) (owner: 10Jforrester) [17:42:47] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: Capture shutdown/destruct backtrace in php7-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/559262 (https://phabricator.wikimedia.org/T241097) (owner: 10Krinkle) [17:46:17] !log jforrester@deploy1001 Synchronized static/images/project-logos/trwiki.png: [trwiki] Change logo to reflect unblocking, 1x T242977 (duration: 00m 56s) [17:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:20] T242977: Change Turkish Wikipedia logo to reflect occasion of unblocking - https://phabricator.wikimedia.org/T242977 [17:47:26] !log jforrester@deploy1001 Synchronized static/images/project-logos/trwiki-1.5x.png: [trwiki] Change logo to reflect unblocking, 1.5x T242977 (duration: 00m 55s) [17:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:43] \o/ [17:47:56] (03CR) 10Effie Mouzeli: [C: 03+1] "> should i spread them out across rows (re: comments on ticket) ?" [puppet] - 10https://gerrit.wikimedia.org/r/564175 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [17:48:11] and it's live already! [17:48:31] !log jforrester@deploy1001 Synchronized static/images/project-logos/trwiki-2x.png: [trwiki] Change logo to reflect unblocking, 2x T242977 (duration: 00m 56s) [17:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:49] !log Manually purged the trwiki logos from Varnish as part of updating them to reflect unblocking, T242977 [17:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:27] (03CR) 10Cicalese: [C: 03+1] Echo: remove transition echo seen-time storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565120 (https://phabricator.wikimedia.org/T234963) (owner: 10Eevans) [17:55:24] 10Operations, 10ops-codfw, 10Core Platform Team: Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) [17:55:52] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) [17:56:21] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) p:05Triage→03Normal [17:57:30] (03CR) 10Dzahn: "yea, it would work. the check command is "git grep -I -n -P -f typos" (-i would ignore case but is not included)" [puppet] - 10https://gerrit.wikimedia.org/r/565166 (owner: 10Legoktm) [17:59:43] jouncebot: now [17:59:44] For the next 0 hour(s) and 0 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T1700) [18:00:02] RECOVERY - cassandra-a service on restbase2021 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:00:05] cscott, arlolra, subbu, halfak, and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T1800). [18:00:10] RECOVERY - cassandra-a SSL 10.192.16.153:7001 on restbase2021 is OK: SSL OK - Certificate restbase2021-a valid until 2022-01-15 15:52:56 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [18:01:10] !log bootstrapping restbase2021-a — T243000 [18:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:14] T243000: Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 [18:03:48] 10Operations, 10netops, 10cloud-services-team (Kanban): WMCS: cleanup network allocations - https://phabricator.wikimedia.org/T240670 (10aborrero) 05Resolved→03Open a:05aborrero→03ayounsi @ayounsi I detected that switches still these vlan definitions in them. Please cleanup them when you have a momen... [18:10:27] 10Operations, 10netops, 10cloud-services-team (Kanban): asw-b-codfw: fixes for openstack - https://phabricator.wikimedia.org/T243002 (10aborrero) [18:14:32] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 37699456 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:16:47] (03CR) 10Dzahn: Consistently capitalize MediaWiki properly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/565166 (owner: 10Legoktm) [18:17:37] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1792 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:20:19] (03PS1) 10CDanis: puppet-merge: clean up extraneous newlines in output [puppet] - 10https://gerrit.wikimedia.org/r/565359 [18:23:31] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10Eevans) [18:23:45] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (No Need By Date) rack/setup/install restbase1029, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Eevans) [18:25:48] (03CR) 10Ottomata: [C: 04-1] "We won't be able to do this until Mediawiki can do HTTPS requests :p" [puppet] - 10https://gerrit.wikimedia.org/r/562792 (https://phabricator.wikimedia.org/T241073) (owner: 10Alexandros Kosiaris) [18:26:17] PROBLEM - MariaDB Slave Lag: s8 on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 523.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:28:45] RECOVERY - MariaDB Slave Lag: s8 on dbstore1005 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:38:25] 10Operations, 10Graphoid, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10CCicalese_WMF) @WDoranWMF Is there work for CPT here past changing the RESTBase configuration? [18:39:43] (03PS2) 10Bstorm: k8s: Don't restart all k8s machinery to reboot a basic webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563624 (https://phabricator.wikimedia.org/T236202) [18:49:25] (03CR) 10Bstorm: kubernetes: ingress: introduce annotation to redirect the webapp root (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) (owner: 10Arturo Borrero Gonzalez) [18:51:37] (03CR) 10Bstorm: [C: 03+2] kubernetes: ingress: introduce annotation to redirect the webapp root [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) (owner: 10Arturo Borrero Gonzalez) [18:54:00] !log arlolra@deploy1001 Started deploy [parsoid/deploy@7bf9819]: Updating Parsoid to 02f0066 [18:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:39] (03CR) 10Dzahn: "Mark or Faidon, could you approve this root access?" [puppet] - 10https://gerrit.wikimedia.org/r/564171 (https://phabricator.wikimedia.org/T242309) (owner: 10Dzahn) [18:57:13] (03CR) 10Bstorm: k8s: Don't restart all k8s machinery to reboot a basic webservice (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563624 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [18:57:38] (03PS3) 10Bstorm: k8s: Don't restart all k8s machinery to reboot a basic webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563624 (https://phabricator.wikimedia.org/T236202) [19:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T1900). [19:00:05] RoanKattouw and urandom: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:10] (03PS4) 10Bstorm: k8s: Don't restart all k8s machinery to reboot a basic webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563624 (https://phabricator.wikimedia.org/T236202) [19:00:12] o/ [19:00:18] I'll SWAT [19:00:35] RoanKattouw: cool, my patch is to undo the echoseen multi-write [19:00:44] * RoanKattouw raises eyebrow [19:00:50] RoanKattouw: I kinda wanted you to know/see that first anyways... [19:01:12] Oh, I see [19:01:14] RoanKattouw: seen-time apocalypse is upon us [19:01:29] Undo the mulit-write in the sense that we'll exclusively using the new backend (Kask) [19:01:34] right [19:01:42] the 404 is pretty low at this point [19:01:47] OK, sounds good. For a minute I thought you were talking about rolling back to the previous situation [19:01:47] 404 rate, that is [19:01:53] no no :) [19:02:04] (03CR) 10Dzahn: [C: 03+1] "lgtm. noted there are 2 different keys on the ticket but this is the later one and matches. UID and LDAP user also matching. has Nuria app" [puppet] - 10https://gerrit.wikimedia.org/r/565304 (https://phabricator.wikimedia.org/T242807) (owner: 10Muehlenhoff) [19:02:13] (03PS3) 10Catrope: Echo: remove transition echo seen-time storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565120 (https://phabricator.wikimedia.org/T234963) (owner: 10Eevans) [19:02:20] (03CR) 10Catrope: [C: 03+2] Echo: remove transition echo seen-time storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565120 (https://phabricator.wikimedia.org/T234963) (owner: 10Eevans) [19:02:21] SRE wants to bring down the redis cluster soon, so I need to get echo and sessions off it [19:02:31] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@7bf9819]: Updating Parsoid to 02f0066 (duration: 08m 30s) [19:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:04] just fyi, I may need to roll that back :/ [19:03:22] (03Merged) 10jenkins-bot: Echo: remove transition echo seen-time storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565120 (https://phabricator.wikimedia.org/T234963) (owner: 10Eevans) [19:05:39] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Echo: switch entirely to Kask, remove Redis fallback (T234963) (duration: 00m 56s) [19:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:43] T234963: Deploy final configuration - https://phabricator.wikimedia.org/T234963 [19:06:00] (03PS5) 10Mstyles: [cirrus] A/B test for MLR models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559614 (https://phabricator.wikimedia.org/T219534) [19:08:16] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Remove kask-echoseen-transition definition, now unused (T234963) (duration: 01m 35s) [19:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:39] !log arlolra@deploy1001 Started deploy [parsoid/deploy@7bf9819]: (no justification provided) [19:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:27] urandom: OK it's live, how are things looking? [19:11:47] live fleetwide? [19:13:25] yup [19:13:47] RoanKattouw: everything looks fine, I think, but there wouldn't be much to see [19:14:00] I mean, traffic on the redis cluster is too large to really see a drop [19:14:20] but nothing changed on the new store (which is good) [19:14:52] No errors [19:14:58] LGTM [19:15:52] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@7bf9819]: (no justification provided) (duration: 07m 13s) [19:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:57] OK cool [19:15:59] Thanks urandom [19:16:07] thank you! [19:16:17] (03PS3) 10Catrope: GrowthExperiments: Enable topic search, behind a hidden preference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564183 (https://phabricator.wikimedia.org/T242698) [19:16:49] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Enable topic search, behind a hidden preference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564183 (https://phabricator.wikimedia.org/T242698) (owner: 10Catrope) [19:17:49] (03Merged) 10jenkins-bot: GrowthExperiments: Enable topic search, behind a hidden preference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564183 (https://phabricator.wikimedia.org/T242698) (owner: 10Catrope) [19:19:13] 10Operations, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T242511 (10Cmjohnson) a new disk has been sent [19:20:13] (03CR) 10Will Doran: "As Hugh Nowlan's manager I approve his having this access provided it is approved by the appropriate SRE folks." [puppet] - 10https://gerrit.wikimedia.org/r/564171 (https://phabricator.wikimedia.org/T242309) (owner: 10Dzahn) [19:26:55] (03CR) 10Dzahn: [C: 03+2] profile::aptrepo::wikimedia: Use types for arguments [puppet] - 10https://gerrit.wikimedia.org/r/565251 (owner: 10Muehlenhoff) [19:33:19] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: GrowthExperiments: Enable topic search, behind a hidden preference (T242698) (duration: 00m 56s) [19:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:24] T242698: Newcomer tasks: hidden preference - https://phabricator.wikimedia.org/T242698 [19:44:28] (03CR) 10Dzahn: [C: 03+2] Create separate role for repository servers [puppet] - 10https://gerrit.wikimedia.org/r/565249 (https://phabricator.wikimedia.org/T224576) (owner: 10Muehlenhoff) [19:44:37] (03PS2) 10Dzahn: Create separate role for repository servers [puppet] - 10https://gerrit.wikimedia.org/r/565249 (https://phabricator.wikimedia.org/T224576) (owner: 10Muehlenhoff) [19:52:31] 10Operations, 10DC-Ops, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-jessie): Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10thcipriani) 05Open→03Invalid >>! In T239880#5803313, @RobH wrote: > So t... [19:52:34] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10thcipriani) [19:53:14] RoanKattouw: still swatting, or can I do a patch? [19:53:49] Urbanecm: I'm done, it's all yours [19:54:01] thanks [19:54:30] (03CR) 10Urbanecm: [C: 03+2] Fix mistakes in HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565268 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [19:55:02] (03CR) 10Urbanecm: [C: 03+2] Add logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565269 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [19:55:25] (03Merged) 10jenkins-bot: Fix mistakes in HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565268 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [19:55:39] (03PS3) 10Urbanecm: Add logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565269 (https://phabricator.wikimedia.org/T150618) [19:55:45] (03CR) 10Urbanecm: [C: 03+2] Add logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565269 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [19:56:14] (03CR) 10Dzahn: [C: 03+2] codesearch: Migrate ./write_config.py cron job to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/564857 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [19:56:40] (03Merged) 10jenkins-bot: Add logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565269 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [19:57:07] RECOVERY - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is OK: TCP OK - 0.036 second response time on 10.192.16.153 port 9042 https://phabricator.wikimedia.org/T93886 [19:57:10] (03CR) 10Dzahn: "don't forget you will have to manually remove the existing cron job. puppet will not do that unless you keep the resource with "ensure => " [puppet] - 10https://gerrit.wikimedia.org/r/564857 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [19:58:12] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: b558eea: Fix mistakes in HD logos (T150618) (duration: 00m 56s) [19:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:16] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [20:00:04] liw and brennen: I, the Bot under the Fountain, allow thee, The Deployer, to do Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200116T2000). [20:00:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 5a32bde: Add logos to IS.php (T150618) (duration: 00m 56s) [20:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:00] !log Purge 12 logos URLs (T150618) [20:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:15] * Urbanecm is done [20:04:02] (03CR) 10Dzahn: "we should start to use puppet-compiler for these before merging. unfortunately they don't know about codesearch5 yet, only codesearch4" [puppet] - 10https://gerrit.wikimedia.org/r/564858 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [20:11:20] (03PS2) 10Dzahn: gerrit: assign host gerrit-test role::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/562587 (https://phabricator.wikimedia.org/T239151) (owner: 10Herron) [20:13:47] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 246710744 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:13:53] !log bootstrapping restbase2021-b — T243000 [20:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:56] T243000: Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 [20:15:33] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 64 and 64 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:17:06] mutante: is there something I can do to make puppet-compiler to know about the new codesearch instance? [20:19:19] (03CR) 10Legoktm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/564857 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [20:23:12] !log mforns@deploy1001 Started deploy [analytics/refinery@26a587a]: deploying analytics-refinery to accompany refinery-source v0.0.112 [20:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:15] legoktm: no, unfortunately not. unless you first get access to puppetmasters and the puppet-compiler hosts in cloud. [20:24:15] https://tools.wmflabs.org/openstack-browser/project/puppet-diffs looks like you have access? [20:24:37] (03CR) 10Dzahn: [C: 03+2] codesearch: Generate hound-${name} systemd units [puppet] - 10https://gerrit.wikimedia.org/r/564858 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [20:24:45] (03PS4) 10Dzahn: codesearch: Generate hound-${name} systemd units [puppet] - 10https://gerrit.wikimedia.org/r/564858 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [20:24:49] i'll just merge it.. nothing that can break [20:24:57] besides puppet on a new instance [20:29:26] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.15/extensions/CentralAuth/includes/GlobalRename/GlobalRenameBlacklist.php: Special:GlobalRenameRequest: Initialize blacklist even if empty T242974 (duration: 00m 57s) [20:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:30] T242974: PHP Warning: Invalid argument supplied for foreach() - https://phabricator.wikimedia.org/T242974 [20:31:54] mutante: the puppet variables don't seem to have worked in the .service file: https://paste.centos.org/view/00a1595a [20:32:38] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/565359 (owner: 10CDanis) [20:32:48] (03CR) 10BryanDavis: kubernetes: ingress: introduce annotation to redirect the webapp root (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) (owner: 10Arturo Borrero Gonzalez) [20:35:35] (03CR) 10CDanis: [C: 03+2] puppet-merge: clean up extraneous newlines in output [puppet] - 10https://gerrit.wikimedia.org/r/565359 (owner: 10CDanis) [20:37:18] !log mforns@deploy1001 Finished deploy [analytics/refinery@26a587a]: deploying analytics-refinery to accompany refinery-source v0.0.112 (duration: 14m 06s) [20:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:02] (03PS1) 10CDanis: empty commit: no-op just for testing puppet-merge change [puppet] - 10https://gerrit.wikimedia.org/r/565386 [20:39:16] (03CR) 10CDanis: [C: 03+2] empty commit: no-op just for testing puppet-merge change [puppet] - 10https://gerrit.wikimedia.org/r/565386 (owner: 10CDanis) [20:39:54] I've read through https://puppet.com/docs/puppet/latest/lang_template_erb.html#concept-5365 but I don't see where I've gone wrong [20:40:20] legoktm the => [20:40:29] !log mforns@deploy1001 Started deploy [analytics/refinery@26a587a] (thin): deploying analytics-refinery to accompany refinery-source v0.0.112 [20:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:31] i don't see it either but there must be a typo [20:40:32] needs to be <%= @var %> [20:40:36] !log mforns@deploy1001 Finished deploy [analytics/refinery@26a587a] (thin): deploying analytics-refinery to accompany refinery-source v0.0.112 (duration: 00m 07s) [20:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:38] what paladox says :) [20:41:01] ohhh [20:41:52] yep, i kept staring at it but didn't see it, thanks paladox [20:41:59] it's in line 2 [20:42:01] yw :) [20:42:36] (03PS1) 10Legoktm: codesearch: Fix syntax in hound.service.erb [puppet] - 10https://gerrit.wikimedia.org/r/565387 [20:42:52] thanks paladox :) [20:43:09] (03CR) 10Paladox: [C: 03+1] codesearch: Fix syntax in hound.service.erb [puppet] - 10https://gerrit.wikimedia.org/r/565387 (owner: 10Legoktm) [20:43:11] yw :) [20:43:39] (03CR) 10Dzahn: [C: 03+2] codesearch: Fix syntax in hound.service.erb [puppet] - 10https://gerrit.wikimedia.org/r/565387 (owner: 10Legoktm) [20:47:20] legoktm@codesearch5:~$ sudo docker ps [20:47:21] Segmentation fault [20:47:42] niiiiiice [20:47:52] I think it's all working [20:48:11] it just ran out of memory cold starting up all of codesearch at once [20:53:12] mutante: thank you for all your help, we now have a fully puppetized codesearch \o/ [20:53:19] \o/ [20:53:28] legoktm: yay:) [20:53:48] now we can move it to prod ?:p [20:54:52] i mean.. making a ganeti VM for it could theoretically happen [20:55:47] I think the main blocker is that we're currently using upstream's docker image and I think we'd want to build our own docker images for it [20:56:22] hmm. right [20:56:37] legoktm@codesearch5:~$ sudo docker ps | grep extensions [20:56:37] 0bad1df7a0e0 etsy/hound "/go/bin/houndd -con…" 9 minutes ago Up 292 years hound-extensions [20:58:08] 292 years ?:o [20:58:55] seems to be very broken [20:59:17] ye older dockere [20:59:18] I think I need to move this to a bigger instance size now [21:06:13] (03CR) 10Dzahn: [C: 04-1] "this can be compiled now https://puppet-compiler.wmflabs.org/compiler1001/20401/gerrit-test.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/562587 (https://phabricator.wikimedia.org/T239151) (owner: 10Herron) [21:21:45] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) [21:22:11] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) [21:24:26] 10Operations, 10DC-Ops, 10decommission: decommission frdb2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242983 (10Jgreen) a:05Jgreen→03Papaul [21:30:18] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) [21:32:35] (03PS6) 10Ottomata: [POC] eventgate - Use service.name as primary resource grouping, not wmf.releasename [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 [21:35:33] (03PS1) 10Dzahn: install_server: rename gerrit-test to gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/565392 (https://phabricator.wikimedia.org/T239151) [21:40:15] (03CR) 10Bstorm: kubernetes: ingress: introduce annotation to redirect the webapp root (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) (owner: 10Arturo Borrero Gonzalez) [21:41:10] (03CR) 10Dzahn: "gerrit-test is actually the service name and each gerrit server has 2 (4) IPs. so this needed to be gerrit1002 and then the service on it " [puppet] - 10https://gerrit.wikimedia.org/r/562575 (https://phabricator.wikimedia.org/T239151) (owner: 10Herron) [21:41:59] (03CR) 10Dzahn: [C: 03+2] install_server: rename gerrit-test to gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/565392 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [21:42:49] (03CR) 10Bstorm: "I missed that this didn't have a changelog, so I've included it on this https://gerrit.wikimedia.org/r/c/operations/software/tools-webserv" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565259 (https://phabricator.wikimedia.org/T242719) (owner: 10Arturo Borrero Gonzalez) [21:48:11] 10Operations, 10TechCom-RFC, 10Traffic, 10Core Platform Team Legacy (Designing), 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906 (10chasemp) with no meaningful upda... [21:53:17] 10Operations: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) [21:53:35] 10Operations: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) [21:53:55] 10Operations, 10Gerrit: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) [21:54:04] 10Operations, 10Gerrit: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) p:05Triage→03Normal [21:54:37] 10Puppet, 10VPS-project-codesearch: Puppetize codesearch - https://phabricator.wikimedia.org/T242319 (10Dzahn) sounds like it's resolved already :) [21:57:56] (03PS3) 10Dzahn: gerrit: assign host gerrit1002 role::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/562587 (https://phabricator.wikimedia.org/T239151) (owner: 10Herron) [21:59:23] (03PS1) 10Dzahn: site: replace gerrit-test with gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/565395 (https://phabricator.wikimedia.org/T239151) [22:00:22] (03CR) 10Dzahn: [C: 03+2] site: replace gerrit-test with gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/565395 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [22:12:27] Jan 16 22:11:16 codesearch6 write_config.py[24165]: with open(os.path.join(directory, 'config.json'), 'w') as f: [22:12:28] Jan 16 22:11:16 codesearch6 write_config.py[24165]: PermissionError: [Errno 13] Permission denied: '/srv/hound/hound-search/config.json' [22:12:57] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 80250944 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:12:59] hm [22:16:41] fixed. [22:21:37] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) [22:21:58] !log bootstrapping restbase2021-c — T243000 [22:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:05] T243000: Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 [22:22:37] (03PS1) 10Catrope: GrowthExperiments: Enable suggested edits everywhere (except euwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565398 [22:23:19] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 57360 and 487 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:28:09] (03PS1) 10Dzahn: add IPs for gerrit1002 in row C [dns] - 10https://gerrit.wikimedia.org/r/565399 (https://phabricator.wikimedia.org/T239151) [22:28:18] (03CR) 10jerkins-bot: [V: 04-1] add IPs for gerrit1002 in row C [dns] - 10https://gerrit.wikimedia.org/r/565399 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [22:29:09] (03PS2) 10Dzahn: add IPs for gerrit1002 in row C [dns] - 10https://gerrit.wikimedia.org/r/565399 (https://phabricator.wikimedia.org/T239151) [22:29:28] (03CR) 10jerkins-bot: [V: 04-1] add IPs for gerrit1002 in row C [dns] - 10https://gerrit.wikimedia.org/r/565399 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [22:34:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [22:34:55] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [22:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [22:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:48] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [22:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:41] !log ganeti1003 - deleting VM gerrit-test (T239151) [22:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:43] T239151: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 [22:41:50] (03PS1) 10Jforrester: [trwiki] Tweak unblocking logo versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565401 (https://phabricator.wikimedia.org/T242977) [22:47:46] (03CR) 10BryanDavis: [C: 03+1] k8s: Don't restart all k8s machinery to reboot a basic webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563624 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [23:08:53] (03CR) 10Bstorm: [C: 03+2] k8s: Don't restart all k8s machinery to reboot a basic webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563624 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [23:11:33] (03PS1) 10Dzahn: phabricator: add warning about /srv/dumps [puppet] - 10https://gerrit.wikimedia.org/r/565403 [23:13:32] (03PS2) 10Dzahn: phabricator: add warning about /srv/dumps [puppet] - 10https://gerrit.wikimedia.org/r/565403 [23:14:41] (03PS1) 10Dzahn: gerrit: rename gerrit-test to gerrit1002 in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/565404 [23:15:49] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Papaul) Update from Dell after today's meeting. They are looking for a way to see if it is possible and doable to rename the interfaces o... [23:17:44] (03PS1) 10Bstorm: dumps-distribution: switch to sharing out only the pertinent dir on nfs [puppet] - 10https://gerrit.wikimedia.org/r/565405 (https://phabricator.wikimedia.org/T242798) [23:21:52] (03CR) 10Paladox: [C: 03+1] phabricator: add warning about /srv/dumps [puppet] - 10https://gerrit.wikimedia.org/r/565403 (owner: 10Dzahn) [23:28:57] (03CR) 10Bstorm: "In livetesting in toolsbeta, I found to my surprise that the nfsclient setup will actually switch /etc/fstab and seemlessly remount and re" [puppet] - 10https://gerrit.wikimedia.org/r/565405 (https://phabricator.wikimedia.org/T242798) (owner: 10Bstorm) [23:30:58] (03CR) 10Bstorm: "compiler looks like what I had in mind. https://puppet-compiler.wmflabs.org/compiler1001/20402/" [puppet] - 10https://gerrit.wikimedia.org/r/565405 (https://phabricator.wikimedia.org/T242798) (owner: 10Bstorm) [23:33:58] mutante: I have never removed a running app from prod. I just created T243037 and you came to mind as someone to look over the checklist and help me find more things that need doing. [23:33:59] T243037: Shutdown scholarships.wikimedia.org and archive project - https://phabricator.wikimedia.org/T243037 [23:34:25] and maybe fill in some details about my "Remove application from Wikimedia Foundation production" item there [23:36:15] bd808: "Delete subdomain from DNS or redirect elsewhere" [23:36:24] bd808: ahh. that sounds familiar. i think i was involved in setting that up. yea. i'll look. just got my food though [23:36:35] yes, delete from DNS +1 [23:36:37] mutante: no rush! [23:36:57] dns is apparently going elsewhere: T243032 [23:36:57] T243032: Domain / Subdomain for Wikimania Scholarship Public Form on CRM - https://phabricator.wikimedia.org/T243032 [23:37:37] bd808: remove mysql grants from modules/role/templates/mariadb/grants/production-m2.sql.erb [23:38:12] (03PS9) 10Jeena Huneidi: Modify Restrouter chart to allow for minikube development [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) [23:38:23] CiviCRM is back ? [23:38:25] uhm.. [23:38:36] still remembers that in production [23:38:38] external hosting apparently [23:39:01] that sounds problematic [23:39:06] pointing wikimedia.org to external [23:39:38] don't we do that already in a bunch of cases? [23:39:53] https://store.wikimedia.org/ etc [23:40:14] bd808: external this year, local server next FY base don one of the tasks [23:40:17] yea, and that is an ongoing issue since a long time ..specifically a ticket about the shop [23:40:19] *based [23:41:15] p858snake: I won't hold my breath. :) I think the move to internal hosting is gated on getting hiring approval for new folks to run it [23:41:33] "Plesk configuration" wow.. that's scary [23:41:45] once had to handle Plesk stuff.. the worst [23:41:48] 10Operations, 10WMDE-Analytics-Engineering, 10Graphite, 10User-Addshore: Regularly & Automatically backup WMDE metrics stored in graphite - https://phabricator.wikimedia.org/T125408 (10Addshore) [23:41:59] would you like some security holes with that?\ [23:42:52] external, plesk, provide external with SSL cert, private data, system will send email as @wikimedia.org .. lots of fun in there [23:43:10] 10Operations, 10WMDE-Analytics-Engineering, 10Graphite, 10User-Addshore: Regularly & Automatically backup WMDE metrics stored in graphite - https://phabricator.wikimedia.org/T125408 (10Addshore) a:03Addshore @fgiunchedi Any idea if there is any sort of regular / scheduled backups of the disks for graphit... [23:44:39] (03PS10) 10Jeena Huneidi: Modify Restrouter chart to allow for minikube development [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) [23:45:23] (03CR) 10Jeena Huneidi: "> Patch Set 6: Code-Review-1" (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) (owner: 10Jeena Huneidi) [23:48:18] 10Operations, 10Security-Team, 10Traffic, 10CRM (Jan-Mar-2020): Domain / Subdomain for Wikimania Scholarship Public Form on CRM - https://phabricator.wikimedia.org/T243032 (10Dzahn) [23:54:26] 10Puppet, 10VPS-project-codesearch: Puppetize codesearch - https://phabricator.wikimedia.org/T242319 (10Legoktm) Everything seems to work, but there's a few issues when bootstrapping a new node: * The hound-* instances shouldn't start until after a successful run of `codesearch-write-config.service`. If they s... [23:55:04] mutante: maybe suggest setting up a dedicated domain like wikimediascholarships, which is what happened with one of the mailing services from memory [23:55:18] *of the other mailing