[00:00:49] PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [00:13:45] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 820.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:16:46] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10RobH) [00:16:59] RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:21:49] (03PS1) 10Dzahn: trafficserver: add Icinga notes url for nrpe_monitor_script [puppet] - 10https://gerrit.wikimedia.org/r/521380 [00:22:21] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [00:22:51] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10RobH) [00:24:24] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10RobH) [00:33:49] (03PS1) 10Dzahn: proxysql: add icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/521382 [00:39:37] (03PS1) 10DannyS712: Disable flaggedrevs for hewikisource main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521383 (https://phabricator.wikimedia.org/T227000) [00:40:56] (03PS2) 10DannyS712: Disable flaggedrevs for hewikisource main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521383 (https://phabricator.wikimedia.org/T227000) [00:41:09] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [00:48:25] (03PS1) 10Dzahn: nrpe: add notes_url parameter to spec and tests [puppet] - 10https://gerrit.wikimedia.org/r/521386 [00:49:05] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10ayounsi) 05Resolved→03Open > CRITICAL (for 9d 15h 51m 18s) > cluster=misc device=megaraid,14 instance=helium:9100 job=node site=eqiad https://icinga.wikimedia.org/cgi-bin/icinga/exti... [00:49:15] (03CR) 10jerkins-bot: [V: 04-1] nrpe: add notes_url parameter to spec and tests [puppet] - 10https://gerrit.wikimedia.org/r/521386 (owner: 10Dzahn) [00:49:32] ACKNOWLEDGEMENT - Device not healthy -SMART- on helium is CRITICAL: cluster=misc device=megaraid,14 instance=helium:9100 job=node site=eqiad Ayounsi https://phabricator.wikimedia.org/T205364 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=helium&var-datasource=eqiad+prometheus/ops [00:54:14] 10Operations, 10ops-esams, 10Traffic: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T222041 (10ayounsi) >>! In T222041#5260488, @Papaul wrote: > cp3037 is dead, it needs new mainboard since it is out of warranty better decommission it. Based on that comment I changed the status of tha... [00:57:47] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [01:06:22] (03PS1) 10DannyS712: Clean up `wgNamespacesWithSubpages` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521390 (https://phabricator.wikimedia.org/T227546) [01:10:18] (03PS2) 10DannyS712: Clean up `wgNamespacesWithSubpages` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521390 (https://phabricator.wikimedia.org/T227546) [01:12:51] (03PS3) 10DannyS712: Clean up `wgNamespacesWithSubpages` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521390 (https://phabricator.wikimedia.org/T227546) [01:18:52] 10Operations: Host mw2250 is not in mediawiki-installation dsh group - https://phabricator.wikimedia.org/T227547 (10ayounsi) p:05Triage→03Normal [01:19:37] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2250 is CRITICAL: Host mw2250 is not in mediawiki-installation dsh group Ayounsi https://phabricator.wikimedia.org/T227547 https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist [01:35:28] !log restart PHP FPM on mwdebug1002 [01:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:49] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:45:29] 10Operations, 10ops-codfw: SSH to mw2269.mgmt not working - https://phabricator.wikimedia.org/T227548 (10ayounsi) p:05Triage→03Normal [01:45:52] ACKNOWLEDGEMENT - SSH mw2269.mgmt on mw2269.mgmt is CRITICAL: Server answer: Ayounsi https://phabricator.wikimedia.org/T227548 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:47:26] !log restart PHP FPM on mwdebug2001 [01:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:53] (03PS1) 10Dzahn: ipmi: add icinga notes_url for IPMI sensor [puppet] - 10https://gerrit.wikimedia.org/r/521401 [02:20:41] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:25:13] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:25:39] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:43:03] (03PS1) 10BPirkle: Specify CentralAuth session storage separately from per-wiki session storage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521409 (https://phabricator.wikimedia.org/T227097) [02:43:11] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:44:13] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [03:16:17] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:16:27] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:16:33] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:16:35] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:17:23] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:20:05] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [03:20:43] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:20:55] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:21:01] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:21:01] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:21:45] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:30:17] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [03:53:03] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [04:14:53] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 41.89 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:15:08] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 46.48 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:16:23] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 78.2 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:16:26] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10ArielGlenn) Adding @hoo because wikidata entity dumps will be impacted. [04:16:37] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 92.58 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:20:15] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:42:19] (03PS4) 10Marostegui: mariadb: Promote db1132 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/519975 (https://phabricator.wikimedia.org/T226952) [04:44:26] (03CR) 10Marostegui: mariadb: Promote db1132 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/519975 (https://phabricator.wikimedia.org/T226952) (owner: 10Marostegui) [04:53:55] !log Reboot pc2010 to debug a memory issue [04:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:58] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) [05:06:12] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) p:05Triage→03Normal [05:13:17] !log Rebooting pc2010 for a second time as per papaul's suggestion T226952 [05:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:22] T226952: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 [05:13:47] !log Rebooting pc2010 for a second time as per papaul's suggestion T227552 [05:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:53] T227552: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 [05:14:43] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) >>! In T226952#5316025, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://tools.wmflabs.o... [05:17:58] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) As per my chat with @Papaul I rebooted the host a second time and the previous error didn't show up. [05:19:39] !log Start switchover steps T226952 [05:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:44] T226952: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 [05:21:34] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10jcrespo) ` $ ./replication_tree.py db1065 db1065, version: 10.1.33, up: 1y, RO: OFF, binlog: MIXED, lag: None, processes: None, latency: 0.0991 +... [05:23:31] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) @Papaul and myself chatted about this and the plan is to: - Clear logs (I just did) - Upgrade firmware, BIOS etc - Leave this task open for a week to see if it happens again and if not c... [05:27:07] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1132 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/519975 (https://phabricator.wikimedia.org/T226952) (owner: 10Marostegui) [05:29:44] * volans here [05:30:58] marostegui: anything I can help with for the preparation phase? [05:31:11] volans: nope! thanks :) [05:38:37] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1132 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/519975 (https://phabricator.wikimedia.org/T226952) (owner: 10Marostegui) [05:55:19] (03PS1) 10Vgutierrez: Add ncredir-lb records [dns] - 10https://gerrit.wikimedia.org/r/521414 (https://phabricator.wikimedia.org/T133548) [05:55:19] marostegui: if after the switch you can give me 30 seconds before killing existing connections I can check what's debmonitor behaviour in this case and if it needs any manual intervention (also for future stuff) [05:55:47] volans: sure! [05:57:23] (03PS2) 10Elukey: aptrepo: add missing update for amd-rocm [puppet] - 10https://gerrit.wikimedia.org/r/521319 (https://phabricator.wikimedia.org/T224723) [06:00:04] marostegui and jynus: Dear deployers, time to do the m2 database master failover deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190709T0600). [06:00:07] jynus and volans ready? [06:00:17] * volans here [06:00:19] yes [06:00:22] !log Failover m2 from db1065 to db1132 - T226952 [06:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:33] T226952: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 [06:00:43] done [06:00:54] django.db.utils.OperationalError: (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement')[2019-07-09T06:00:41] [06:01:03] but now it seems it recovered by itself [06:01:07] needs proxy reload [06:01:10] :-) [06:01:12] it is already done [06:01:14] and/or connections killed [06:01:21] (03CR) 10Elukey: [C: 03+2] aptrepo: add missing update for amd-rocm [puppet] - 10https://gerrit.wikimedia.org/r/521319 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [06:01:23] also done :) [06:01:25] all good without touching, already back to rw [06:01:30] nice [06:01:42] marostegui: what about those 30s? :D [06:01:53] haha you said it was already good! :p [06:01:59] ah ok [06:02:07] so you did it after I said it was ok [06:02:13] yep [06:02:21] (the killing connections part) [06:02:49] i see no connections left on db1065 [06:02:58] nice,so apparently fail to write because RO is enough for django to retry the connection, nice to know [06:03:11] Updating tendril... [06:03:11] [WARNING] Old master not found on tendril server list [06:03:11] Updating zarcillo... [06:03:11] Zarcillo updated successfully: db1132/(none) is the new master of m2 at eqiad [06:03:22] I am checking that on tendril [06:03:28] and finishing the last steps [06:04:03] btw maybe we could start a wikitech page similar to the Service Restarts one for DB Failovers :) [06:04:28] I would like some with OTRS account to also verify if it is all good [06:04:34] all is good [06:04:36] akosiaris: maybe has one? [06:04:36] volans: we have one [06:04:39] ah there we go! [06:04:42] thanks akosiaris [06:04:45] yw [06:05:01] akosiaris: kalimera btw [06:05:13] jynus: link so I can add debmonitor :) [06:05:14] most issues when they happen are persistent conenction handling [06:05:25] yeah [06:05:42] https://wikitech.wikimedia.org/wiki/MariaDB/misc#owners,_(or_in_many_cases_just_people_that_volunteer_to_help_for_the_failover)_2 [06:05:44] marostegui: buenos días senor [06:05:50] lol [06:06:03] haha [06:06:27] the hardest part is googling and copy pasting that intonated i [06:06:34] I guess the issue with tendril complaining is because I used db1065 instead of db1065.eqiad.wmnet? [06:06:51] no [06:06:53] to the point that I gave up with the intonated n [06:06:55] akosiaris: You need to get an "ñ" [06:06:58] that should work [06:07:10] marostegui: I will give it a loop [06:07:19] sure [06:08:57] I will update tendril manually [06:09:00] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) This was done successfully. Read only start: 06:00:31 UTC 2019 Read only stop (and proxies reloaded): 06:00:40 UTC 2019 Total read o... [06:09:24] (03PS1) 10Elukey: aptrepo: remove source from amd-rocm's update config [puppet] - 10https://gerrit.wikimedia.org/r/521417 (https://phabricator.wikimedia.org/T224723) [06:09:51] jynus: zarcillo was updated correctly, it was apparently only tendril [06:10:05] strange [06:10:09] (03CR) 10Elukey: [C: 03+2] aptrepo: remove source from amd-rocm's update config [puppet] - 10https://gerrit.wikimedia.org/r/521417 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [06:10:17] I think I have a bug, but on both [06:10:33] zarcillo looks correct [06:10:35] | m2 | eqiad | db1132 | [06:10:35] +---------+-------+----------+ [06:10:48] yes, but maybe by pure chance [06:11:08] there is a bug on the query [06:11:17] haha [06:12:16] * volans updated https://wikitech.wikimedia.org/wiki/MariaDB/misc#Current_schemas_2 with debmonitor details [06:12:26] Thanks <3 [06:12:50] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) Better now! ` root@install1002:/srv/wikimedia# reprepro --noskipold --component thirdparty/amd-rocm checkupdate buster-wikimedia Calculating pac... [06:14:06] just added also the restart procedure, just in case it might be needed [06:14:20] thanks a lot ) [06:21:22] (03PS4) 10Jcrespo: switchover.py: Check binary log format before switch [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521226 [06:21:24] (03PS3) 10Jcrespo: WMFReplication: Parallelize slaves() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521232 [06:21:26] (03PS1) 10Jcrespo: switchover.py: Fix interpolation when updating tendril and zarcillo [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521418 [06:21:35] (03PS2) 10Jcrespo: switchover.py: Fix interpolation when updating tendril and zarcillo [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521418 [06:23:26] (03PS3) 10Jcrespo: switchover.py: Fix interpolation when updating tendril [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521418 [06:23:34] (03PS4) 10Jcrespo: switchover.py: Fix interpolation when updating tendril [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521418 [06:23:46] jynus, marostegui thank youuu [06:30:19] PROBLEM - puppet last run on cp5011 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:30:48] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:32:35] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [06:33:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:39:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:39:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [06:45:39] (03PS1) 10Elukey: aptrepo: remove packages from amd-rocm's whitelist [puppet] - 10https://gerrit.wikimedia.org/r/521419 (https://phabricator.wikimedia.org/T224723) [06:46:42] (03PS2) 10Elukey: aptrepo: remove packages from amd-rocm's whitelist [puppet] - 10https://gerrit.wikimedia.org/r/521419 (https://phabricator.wikimedia.org/T224723) [06:49:42] (03PS3) 10Elukey: aptrepo: remove packages from amd-rocm's whitelist [puppet] - 10https://gerrit.wikimedia.org/r/521419 (https://phabricator.wikimedia.org/T224723) [06:50:18] (03CR) 10Elukey: [C: 03+2] aptrepo: remove packages from amd-rocm's whitelist [puppet] - 10https://gerrit.wikimedia.org/r/521419 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [06:52:40] (03PS2) 10Muehlenhoff: apache::mod_conf: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/520778 [06:53:07] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) New list: ` root@install1002:/srv/wikimedia# reprepro --noskipold --component thirdparty/amd-rocm checkupdate buster-wikimedia Calculating packa... [06:54:41] PROBLEM - MariaDB Slave Lag: pc1 on pc2010 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 17380.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:57:37] RECOVERY - puppet last run on cp5011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:11] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:04:18] (03CR) 10Ema: [C: 03+1] Add ncredir-lb records [dns] - 10https://gerrit.wikimedia.org/r/521414 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [07:08:35] (03PS1) 10Vgutierrez: acme_chief: Avoid retrying too eagerly on CERTIFICATE_STAGED status [software/acme-chief] - 10https://gerrit.wikimedia.org/r/521421 (https://phabricator.wikimedia.org/T225945) [07:10:50] (03CR) 10Muehlenhoff: [C: 03+2] apache::mod_conf: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/520778 (owner: 10Muehlenhoff) [07:15:18] (03PS2) 10Muehlenhoff: kmod::blacklist: Cleanup initramfs trigger [puppet] - 10https://gerrit.wikimedia.org/r/520774 [07:16:48] (03PS2) 10Elukey: profile::base: exclude fuse.fuse_dfs from disk space checks [puppet] - 10https://gerrit.wikimedia.org/r/521272 (https://phabricator.wikimedia.org/T226698) [07:17:51] (03CR) 10Elukey: [C: 03+2] profile::base: exclude fuse.fuse_dfs from disk space checks [puppet] - 10https://gerrit.wikimedia.org/r/521272 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [07:20:24] s/win 22 [07:20:26] uff [07:20:59] RECOVERY - Disk space on an-tool1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [07:25:11] \o/ [07:25:58] (03PS3) 10Muehlenhoff: kmod::blacklist: Cleanup initramfs trigger [puppet] - 10https://gerrit.wikimedia.org/r/520774 [07:26:08] !log upload prometheus-mcrouter-exporter 0.0.0+git20190709-1 to stretch-wikimedia - T225059 [07:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:14] T225059: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 [07:27:14] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/521421 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [07:28:08] (03CR) 10Muehlenhoff: [C: 03+2] kmod::blacklist: Cleanup initramfs trigger [puppet] - 10https://gerrit.wikimedia.org/r/520774 (owner: 10Muehlenhoff) [07:29:37] (03CR) 10Marostegui: "how did it work yesterday with x1 codfw then?" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521418 (owner: 10Jcrespo) [07:30:17] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, 10Recommendation-API: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) 05Open→03Resolved a:03Marostegui [07:30:23] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [07:30:39] (03PS1) 10Jcrespo: switchover.py: Wait a few seconds after gtid disable to catch up [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521423 [07:33:01] (03PS6) 10Jcrespo: mariadb: Prepare core for buster [puppet] - 10https://gerrit.wikimedia.org/r/519073 (https://phabricator.wikimedia.org/T193224) [07:35:02] (03CR) 10Marostegui: [C: 03+1] switchover.py: Wait a few seconds after gtid disable to catch up [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521423 (owner: 10Jcrespo) [07:35:44] (03CR) 10Jcrespo: [C: 03+2] mariadb: Prepare core for buster [puppet] - 10https://gerrit.wikimedia.org/r/519073 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [07:36:14] 10Operations, 10Deployments, 10Release: OSError: [Errno 1] Operation not permitted when running git fat pull - https://phabricator.wikimedia.org/T208259 (10Gehel) @thcipriani Thanks a lot for the detailed explanation! As this step is not really required on the deploy host, I'll change my notes to validate t... [07:39:37] !log pruning unused libzmq3/python-zmq packages from swift/parsoid hosts [07:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:00] (03PS2) 10Ema: varnish: stop sending the Via response header [puppet] - 10https://gerrit.wikimedia.org/r/521261 (https://phabricator.wikimedia.org/T194814) [07:53:52] (03CR) 10Jcrespo: "> Patch Set 4:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521418 (owner: 10Jcrespo) [07:54:01] (03CR) 10Ema: [C: 03+2] varnish: stop sending the Via response header [puppet] - 10https://gerrit.wikimedia.org/r/521261 (https://phabricator.wikimedia.org/T194814) (owner: 10Ema) [07:56:09] (03CR) 10Marostegui: [C: 03+1] switchover.py: Fix interpolation when updating tendril [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521418 (owner: 10Jcrespo) [07:57:26] (03CR) 10Marostegui: [C: 03+1] "Let's continue the discussion about the --force at some point" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521226 (owner: 10Jcrespo) [07:59:38] (03CR) 10Jcrespo: [C: 03+2] switchover.py: Check binary log format before switch [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521226 (owner: 10Jcrespo) [07:59:53] (03PS5) 10Jcrespo: switchover.py: Fix interpolation when updating tendril [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521418 [08:00:16] !log Upgrade db1065 to 10.1.39 [08:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:40] (03CR) 10Jcrespo: [C: 03+2] switchover.py: Fix interpolation when updating tendril [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521418 (owner: 10Jcrespo) [08:01:02] (03Merged) 10jenkins-bot: switchover.py: Fix interpolation when updating tendril [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521418 (owner: 10Jcrespo) [08:01:20] (03PS2) 10Jcrespo: switchover.py: Wait a few seconds after gtid disable to catch up [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521423 [08:01:44] (03CR) 10Jcrespo: [C: 03+2] switchover.py: Wait a few seconds after gtid disable to catch up [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521423 (owner: 10Jcrespo) [08:02:08] (03Merged) 10jenkins-bot: switchover.py: Wait a few seconds after gtid disable to catch up [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521423 (owner: 10Jcrespo) [08:06:37] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:07:09] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:07:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:08:20] !log installing zeromq3 security updates [08:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:25] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:09:24] big spike that seems already recovered (from the 50x dashboard) [08:10:47] elukey: speaking of which :) [08:11:19] the traffic guy! :P [08:11:22] ciao ema [08:11:33] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:11:45] elukey: I've just sent an email to ops@, how do you usually go about debugging MW issues? (grafana dashboards and so forth) [08:11:54] It's me or the linked graph has no data points? [08:12:00] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) >>! In T226236#5296968, @MoritzMuehlenhoff wrote: > Well, if there's a security update for Docker we'll want it... [08:12:15] volans: nono I confirm [08:12:22] volans: which linked graph? [08:12:31] the above one from icinga alert [08:12:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:12:57] the myRmf1Pik/varnish-aggregate-client-status-codes [08:12:59] one [08:13:27] volans: ah, yeah. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 is fine [08:13:44] volans: https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 is not [08:13:56] that's because the dashboard has been updated recently [08:13:57] fixing [08:14:01] thx [08:15:18] we should probably have a daily check on all grafana dashboards marked in some way (the "official" ones) to check that they have data points [08:15:43] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:16:33] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:16:56] (03PS1) 10Muehlenhoff: Add Cumin alias for acmechief [puppet] - 10https://gerrit.wikimedia.org/r/521425 [08:18:18] ema: I am not super expert in debugging these issues on mw appservers, usually it is easier if we can pin point a specific app/api server and check on it (apache logs etc..). In logstash we have the fatal monitor (that should log mw exceptions fatals etc..) but it is a bit noisy [08:18:47] ema: do you have a dashboard for varnish backends misbehaving? [08:19:00] elukey: do you know if there's a dashboard like https://grafana.wikimedia.org/d/0fj55kRZz/thumbor?panelId=43&fullscreen&orgId=1 but for mediawiki? [08:19:02] (misbehaving == reporting failures from backend etc.., not blaming Varnish :) [08:19:30] elukey: yeah, Varnish Fetch Errors [08:20:30] ema: so we have https://grafana.wikimedia.org/d/000000327/apache-hhvm?orgId=1, but it doesn't contain HTTP return codes.. in theory we'd need something that parses httpd's access logs for that, and I think we don't have anything similar [08:20:45] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /mnt/docker/overlay2/f5724cda69fb24bcfe9776790633ebd77afef759c203c7530a560f548df7d96f/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [08:22:15] ema: checked Varnish Fetch Errors and it is like decrypting a TLS connection with tcpdump for me :D [08:22:53] (03CR) 10Vgutierrez: [C: 03+1] Add Cumin alias for acmechief [puppet] - 10https://gerrit.wikimedia.org/r/521425 (owner: 10Muehlenhoff) [08:24:25] elukey: TLS is not that hard... [08:24:26] ;P [08:24:49] ahahahh yes it is probably me then [08:25:09] RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [08:25:10] (03CR) 10Hashar: [C: 03+1] Remove support for Ubuntu from os_version and related tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520765 (owner: 10Muehlenhoff) [08:26:27] (03PS1) 10Ema: monitoring: update icinga links to varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/521427 (https://phabricator.wikimedia.org/T184942) [08:26:31] volans et al: ^ [08:27:28] (03CR) 10Hashar: [C: 03+1] Remove support for Ubuntu from os_version and related tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520765 (owner: 10Muehlenhoff) [08:27:50] (03CR) 10Elukey: [C: 03+1] monitoring: update icinga links to varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/521427 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [08:29:26] (03PS2) 10Matthias Mullie: Configure help urls for MediaInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521298 (https://phabricator.wikimedia.org/T227226) (owner: 10Cparle) [08:30:51] * ema updated his xterm*on2Clicks regex in .Xresources to include % and feels much better [08:32:15] (03CR) 10Matthias Mullie: [C: 03+1] "LGTM. Scheduled for SWAT later today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521298 (https://phabricator.wikimedia.org/T227226) (owner: 10Cparle) [08:32:19] elukey: I've added a few things to https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts, does that seem reasonable or does it still feel like a rosetta stone is needed for decryption? [08:33:01] (03PS1) 10Filippo Giunchedi: hieradata: enable centrallog1001 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/521428 (https://phabricator.wikimedia.org/T200706) [08:33:18] (03PS2) 10Muehlenhoff: Add Cumin alias for acmechief [puppet] - 10https://gerrit.wikimedia.org/r/521425 [08:33:32] ema: looks good thanks! [08:33:47] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Avoid retrying too eagerly on CERTIFICATE_STAGED status [software/acme-chief] - 10https://gerrit.wikimedia.org/r/521421 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [08:34:27] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for acmechief [puppet] - 10https://gerrit.wikimedia.org/r/521425 (owner: 10Muehlenhoff) [08:35:27] (03PS1) 10Muehlenhoff: Fix traceback when running query_restart on a non-library [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/521430 [08:36:04] !log upgrade prometheus-mcrouter-exporter to 0.0.0+git20190709-1 on mw-codfw (cumin alias) via debdeploy - T225059 [08:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:09] T225059: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 [08:37:04] (03CR) 10jenkins-bot: acme_chief: Avoid retrying too eagerly on CERTIFICATE_STAGED status [software/acme-chief] - 10https://gerrit.wikimedia.org/r/521421 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [08:38:18] moritzm: I was wondering if we should add aliases for the distros (jessie, stretch, buster) [08:38:51] (03PS1) 10Marostegui: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521432 [08:39:22] good idea, I currently do 'F:lsbdistcodename = foo' manually, but adding an alias sounds good [08:39:45] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521432 (owner: 10Marostegui) [08:40:19] (03PS2) 10Ema: monitoring: update icinga links to varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/521427 (https://phabricator.wikimedia.org/T184942) [08:40:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521432 (owner: 10Marostegui) [08:40:58] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521432 (owner: 10Marostegui) [08:41:23] (03CR) 10Ema: [C: 03+2] monitoring: update icinga links to varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/521427 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [08:41:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1086 for upgrade (duration: 00m 51s) [08:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:57] !log Upgrade db1086 [08:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:23] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [08:49:24] !log upgrade prometheus-mcrouter-exporter to 0.0.0+git20190709-1 on mw-eqiad (cumin alias) via debdeploy - T225059 [08:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:29] T225059: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 [08:50:22] 10Operations, 10Traffic: ATS is currently adding its own server header - https://phabricator.wikimedia.org/T224119 (10ema) 05Open→03Resolved a:03ema ATS now sets `Server` only if missing in the origin server response. Also, Varnish now does not send `Via` any longer (the header wasn't used at all). [08:50:24] (03PS2) 10Muehlenhoff: hieradata: enable centrallog1001 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/521428 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [08:50:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/521428 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [08:51:20] (03PS1) 10Marostegui: db2038: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/521435 (https://phabricator.wikimedia.org/T227565) [08:51:39] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: enable centrallog1001 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/521428 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [08:52:44] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521436 [08:52:45] (03PS2) 10Marostegui: db2038: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/521435 (https://phabricator.wikimedia.org/T227565) [08:53:51] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521436 (owner: 10Marostegui) [08:53:54] (03CR) 10Marostegui: [C: 03+2] db2038: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/521435 (https://phabricator.wikimedia.org/T227565) (owner: 10Marostegui) [08:54:43] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521436 (owner: 10Marostegui) [08:56:12] (03PS1) 10Muehlenhoff: Extend Cumin alias for hadoop-testcluster [puppet] - 10https://gerrit.wikimedia.org/r/521438 [08:56:22] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521436 (owner: 10Marostegui) [08:56:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1086 after upgrade (duration: 00m 47s) [08:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:49] (03CR) 10Hashar: "It works perfectly now! I have updated the wiki doc at https://www.mediawiki.org/w/index.php?title=Continuous_integration/Zuul&diff=330729" [puppet] - 10https://gerrit.wikimedia.org/r/521315 (owner: 10Hashar) [08:59:29] (03PS3) 10Arturo Borrero Gonzalez: bootstrap-vz: configure base image to use sssd for buster and stretch [puppet] - 10https://gerrit.wikimedia.org/r/521278 (https://phabricator.wikimedia.org/T227475) (owner: 10Andrew Bogott) [09:02:20] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Tarrow) [09:03:57] 10Operations, 10vm-requests: Site: eqiad/codfw VM for ORES pool counters - https://phabricator.wikimedia.org/T227567 (10MoritzMuehlenhoff) [09:04:11] 10Operations, 10vm-requests: Site: eqiad/codfw VM for ORES pool counters - https://phabricator.wikimedia.org/T227567 (10MoritzMuehlenhoff) p:05Triage→03Normal a:03MoritzMuehlenhoff [09:07:23] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521439 [09:08:20] (03PS1) 10Ema: cache: consolidate NVMe settings in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/521440 (https://phabricator.wikimedia.org/T226638) [09:08:53] (03PS1) 10Fsero: helmfile,k8s: creating hfenv variables [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) [09:09:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521439 (owner: 10Marostegui) [09:10:09] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521439 (owner: 10Marostegui) [09:10:45] (03PS1) 10Elukey: profile::prometheus::mcrouter_exporter: enable per-server metrics [puppet] - 10https://gerrit.wikimedia.org/r/521442 (https://phabricator.wikimedia.org/T225059) [09:11:07] (03CR) 10Elukey: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/521438 (owner: 10Muehlenhoff) [09:11:09] (03PS1) 10Arturo Borrero Gonzalez: cloud: sssd: use by default for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/521443 (https://phabricator.wikimedia.org/T227475) [09:11:13] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1086 after upgrade (duration: 00m 49s) [09:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:40] (03CR) 10Elukey: [C: 03+2] profile::prometheus::mcrouter_exporter: enable per-server metrics [puppet] - 10https://gerrit.wikimedia.org/r/521442 (https://phabricator.wikimedia.org/T225059) (owner: 10Elukey) [09:11:44] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521439 (owner: 10Marostegui) [09:12:41] (03PS2) 10Ema: cache: consolidate NVMe settings in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/521440 (https://phabricator.wikimedia.org/T226638) [09:12:59] (03PS1) 10Volans: cumin aliases: add distribution aliases [puppet] - 10https://gerrit.wikimedia.org/r/521444 [09:13:03] (03PS2) 10Fsero: helmfile,k8s: creating hfenv variables [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) [09:13:05] !log enable per-server metrics on all prometheus-mcrouter-exporter(s) via puppet - T225059 [09:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:11] T225059: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 [09:13:22] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [09:15:15] (03PS2) 10Muehlenhoff: Extend Cumin alias for hadoop-testcluster [puppet] - 10https://gerrit.wikimedia.org/r/521438 [09:15:35] (03CR) 10Volans: "Compiler result here:" [puppet] - 10https://gerrit.wikimedia.org/r/521444 (owner: 10Volans) [09:15:41] moritzm: ^^^ :) [09:15:44] (03PS3) 10Fsero: helmfile,k8s: creating hfenv variables [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) [09:17:15] (03CR) 10Muehlenhoff: [C: 03+2] Extend Cumin alias for hadoop-testcluster [puppet] - 10https://gerrit.wikimedia.org/r/521438 (owner: 10Muehlenhoff) [09:17:21] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler1001/17262/" [puppet] - 10https://gerrit.wikimedia.org/r/521440 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [09:17:32] (03PS3) 10Ema: cache: consolidate NVMe settings in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/521440 (https://phabricator.wikimedia.org/T226638) [09:18:02] (03CR) 10Volans: [C: 03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/521430 (owner: 10Muehlenhoff) [09:18:46] (03CR) 10Ema: [C: 03+2] cache: consolidate NVMe settings in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/521440 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [09:19:49] (03PS4) 10Fsero: helmfile,k8s: creating hfenv variables [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) [09:25:02] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1076.eqiad.wmnet,service=ats-be [09:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:02] (03PS5) 10Fsero: helmfile,k8s: creating hfenv variables [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) [09:26:24] !log cp1076: restart trafficserver with storage.config set to /dev/nvme0n1 [09:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:24] (03PS3) 10Cparle: Configure help urls for MediaInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521298 (https://phabricator.wikimedia.org/T227226) [09:28:11] (03CR) 10Fsero: "PCC happy https://puppet-compiler.wmflabs.org/compiler1001/17265/deploy1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [09:30:00] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1076.eqiad.wmnet,service=ats-be [09:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:40] (03CR) 10Jbond: [C: 03+1] "LGTM, i would even vote to make it the default for `nrpe::monitor_service`" [puppet] - 10https://gerrit.wikimedia.org/r/521376 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [09:40:00] (03PS3) 10Muehlenhoff: Remove support for Ubuntu from os_version and related tests [puppet] - 10https://gerrit.wikimedia.org/r/520765 [09:40:09] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:40:18] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/520765 (owner: 10Muehlenhoff) [09:41:04] (03PS1) 10Tarrow: Introduce termbox-test LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/521449 (https://phabricator.wikimedia.org/T226814) [09:41:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/521444 (owner: 10Volans) [09:41:40] (03CR) 10Muehlenhoff: [C: 03+2] Remove support for Ubuntu from os_version and related tests [puppet] - 10https://gerrit.wikimedia.org/r/520765 (owner: 10Muehlenhoff) [09:42:06] (03PS2) 10Volans: cumin aliases: add distribution aliases [puppet] - 10https://gerrit.wikimedia.org/r/521444 [09:42:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/521444 (owner: 10Volans) [09:43:35] (03CR) 10Volans: [C: 03+2] cumin aliases: add distribution aliases [puppet] - 10https://gerrit.wikimedia.org/r/521444 (owner: 10Volans) [09:44:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/521430 (owner: 10Muehlenhoff) [09:44:35] moritzm: I've your patch listed in puppet-merge too :) [09:44:40] sorry I was too quick :D [09:45:20] ok to merge? or feel free to merge mine when you're ready for yours [09:45:56] ack [09:46:03] please merge along [09:46:10] ack [09:47:21] {done} [09:47:36] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10hashar) 05Open→03Resolved Perfect thank you @thcipriani , those tests were exactly... [09:47:42] 10Operations, 10Continuous-Integration-Infrastructure, 10Release Pipeline, 10Release-Engineering-Team-TODO (201907): Switch CI Docker Storage Driver to its own partition and to use devicemapper - https://phabricator.wikimedia.org/T178663 (10hashar) [09:54:05] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) Now the annoying part: ` elukey@stat1005:~$ apt-cache show hsa-rocr-dev Package: hsa-rocr-dev Status: install ok installed Priority: optional Se... [09:54:32] is it just me or is gerrit really really slow? [09:55:20] it's really slow for me as well [09:55:44] it works fine for me [09:56:02] strange [09:56:16] I can't load any patches [09:56:30] can you paste an URL that doesn't work for you? [09:56:43] (03PS1) 10Muehlenhoff: Drop obsolete osm spec test [puppet] - 10https://gerrit.wikimedia.org/r/521451 [09:56:45] (03PS6) 10Fsero: helmfile,k8s: creating hfenv variables [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) [09:57:00] marostegui: any, but this for example https://gerrit.wikimedia.org/r/460552 [09:57:04] (from mail) [09:57:11] works for me [09:57:28] well, I'm using new UI, maybe something's wrong there? [09:57:45] Ah, I am with the old one [09:58:04] fast.com says my internets are fine... [09:58:31] zeljkof: I'm also using the old UI and things work. Try going back to the past? :) [09:59:04] trying :) [09:59:31] well, when I click the link in the footer, I get `Code Review - Error Plugin "reviewers" failed to load` [10:00:06] and `Code Review - Error Cannot load plugin from plugins/reviewers/static/reviewers/reviewers.nocache.js` [10:00:53] I've an error in console, but the interaction is fine time-wise [10:01:05] Uncaught (in promise) TypeError: t.reduce is not a function (in main.js) [10:01:08] (03PS1) 10Tarrow: termbox: add Kubernetes stanzas for test [puppet] - 10https://gerrit.wikimedia.org/r/521452 (https://phabricator.wikimedia.org/T226814) [10:01:40] I'm finding running `git review` is crazy slow as well [10:01:52] https://usercontent.irccloud-cdn.com/file/H4HtCT5C/gerrit.png [10:01:54] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:01:55] (03PS1) 10Elukey: package_builder: install the equivs package [puppet] - 10https://gerrit.wikimedia.org/r/521453 [10:02:10] I am using the new ui [10:02:11] also from the same console main.js:1 GET https://gerrit.wikimedia.org/r/changes/?pp=0&o=TRACKING_IDS&o=DETAILED_LABELS&q=bug: 400 (Bad Request) [10:02:17] but apart that all is fine [10:02:20] but things do seem to work in the old UI [10:02:22] but it is not as if gerrit is unknown to fail :-) [10:02:31] (03PS1) 10Muehlenhoff: No need to remove eject any longer [puppet] - 10https://gerrit.wikimedia.org/r/521454 [10:02:40] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:02:43] (03PS1) 10Vgutierrez: ncredir: Provide a /_status endpoint for LVS monitoring purposes [puppet] - 10https://gerrit.wikimedia.org/r/521455 (https://phabricator.wikimedia.org/T133548) [10:02:55] try logging out, removing cokies and logging in [10:02:59] jynus: new UI works for you? [10:03:02] slow is a subjective term zeljkof could you maybe run an mtr to gerrit.wikimedia.org and make an screenshot of developer console on your browser? [10:03:12] jynus: will do [10:03:12] that would give data :) [10:03:14] zeljkof: for what I do, yes [10:04:36] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [10:04:49] maybe I am using it wrong and that is why it works for me [10:04:52] fsero: "slow" as in the link from mail never opens :) [10:05:09] I've just clicked a link in mail and it opens a blank page :/ [10:05:23] fsero: never used mtr, but I can try [10:05:27] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [10:06:30] (03CR) 10Ema: [C: 03+1] ncredir: Provide a /_status endpoint for LVS monitoring purposes [puppet] - 10https://gerrit.wikimedia.org/r/521455 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [10:06:58] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Provide a /_status endpoint for LVS monitoring purposes [puppet] - 10https://gerrit.wikimedia.org/r/521455 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [10:07:07] (03PS2) 10Vgutierrez: ncredir: Provide a /_status endpoint for LVS monitoring purposes [puppet] - 10https://gerrit.wikimedia.org/r/521455 (https://phabricator.wikimedia.org/T133548) [10:09:09] (03PS1) 10Tarrow: Assign termbox-test.svc.{eqiad,codfw}.wmnet LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/521456 (https://phabricator.wikimedia.org/T226814) [10:09:34] (03CR) 10jerkins-bot: [V: 04-1] Assign termbox-test.svc.{eqiad,codfw}.wmnet LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/521456 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow) [10:09:42] (03PS15) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [10:12:19] (03CR) 10Muehlenhoff: [C: 03+1] package_builder: install the equivs package [puppet] - 10https://gerrit.wikimedia.org/r/521453 (owner: 10Elukey) [10:12:53] (03PS16) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [10:12:55] (03CR) 10Ema: [C: 03+1] package_builder: install the equivs package [puppet] - 10https://gerrit.wikimedia.org/r/521453 (owner: 10Elukey) [10:13:20] (03CR) 10Elukey: [C: 03+2] package_builder: install the equivs package [puppet] - 10https://gerrit.wikimedia.org/r/521453 (owner: 10Elukey) [10:13:28] (03PS2) 10Elukey: package_builder: install the equivs package [puppet] - 10https://gerrit.wikimedia.org/r/521453 [10:13:50] (03PS1) 10Tarrow: typo fix: termbox codfw should have codfw IP [dns] - 10https://gerrit.wikimedia.org/r/521457 [10:14:23] !log upgrade openssl on canary systems [10:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:41] (03CR) 10Tarrow: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/521457 (owner: 10Tarrow) [10:23:40] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:24:15] (03PS1) 10Muehlenhoff: Remove obsolete comments [puppet] - 10https://gerrit.wikimedia.org/r/521458 [10:24:28] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:26:44] (03CR) 10Muehlenhoff: [C: 03+2] Fix traceback when running query_restart on a non-library [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/521430 (owner: 10Muehlenhoff) [10:27:53] (03PS1) 10Tarrow: Enable discovery for termbox-test [dns] - 10https://gerrit.wikimedia.org/r/521459 (https://phabricator.wikimedia.org/T226814) [10:38:40] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10MoritzMuehlenhoff) @Jakob_WMDE: Please generate a separate SSH key for the access to the Wikimedia production cluster (it n... [10:39:26] !log update wikimedia-buster thirparty/amd-rocm component with upstream packages - T224723 [10:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:32] T224723: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 [10:40:17] (03PS1) 10Elukey: profile::statistics::gpu: add packages from thirdparty/rocm [puppet] - 10https://gerrit.wikimedia.org/r/521463 (https://phabricator.wikimedia.org/T224723) [10:42:39] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17267/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/521463 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [10:43:55] (03CR) 10Muehlenhoff: profile::statistics::gpu: add packages from thirdparty/rocm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521463 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [10:45:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] helmfile,k8s: creating hfenv variables (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [10:45:56] (03CR) 10Elukey: [C: 03+2] profile::statistics::gpu: add packages from thirdparty/rocm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521463 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [10:48:02] jouncebot, next [10:48:02] In 0 hour(s) and 11 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190709T1100) [10:51:05] (03PS2) 10Tarrow: Assign termbox-test.svc.{eqiad,codfw}.wmnet LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/521456 (https://phabricator.wikimedia.org/T226814) [10:55:10] (03CR) 10Fsero: helmfile,k8s: creating hfenv variables (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190709T1100). [11:00:05] kart_ and matthiasmullie: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] yep [11:00:22] I can SWAT today! [11:00:55] sure :p [11:01:18] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521298 (https://phabricator.wikimedia.org/T227226) (owner: 10Cparle) [11:01:25] Urbanecm: here [11:01:32] hi kart_ [11:02:09] Urbanecm: doesn't need to go on mwdebug - it's just prep, unused atm [11:02:20] (03CR) 10Tarrow: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/521449 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow) [11:02:21] ack matthiasmullie [11:02:25] (03Merged) 10jenkins-bot: Configure help urls for MediaInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521298 (https://phabricator.wikimedia.org/T227226) (owner: 10Cparle) [11:02:36] (03CR) 10Tarrow: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/521452 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow) [11:02:41] (03CR) 10jenkins-bot: Configure help urls for MediaInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521298 (https://phabricator.wikimedia.org/T227226) (owner: 10Cparle) [11:02:49] (03CR) 10Tarrow: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/521459 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow) [11:03:01] (03CR) 10Tarrow: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/521456 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow) [11:03:11] thanks! [11:03:48] (03PS2) 10Urbanecm: Configuration migration for Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517933 (https://phabricator.wikimedia.org/T87985) (owner: 10Awight) [11:03:54] RECOVERY - MariaDB Slave Lag: pc1 on pc2010 is OK: OK slave_sql_lag Replication lag: 35.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [11:03:56] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517933 (https://phabricator.wikimedia.org/T87985) (owner: 10Awight) [11:04:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:521298|Configure help urls for MediaInfo]] (T227226) (duration: 00m 50s) [11:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:10] T227226: Only Depicts statement panels should have Learn More links (for now) - https://phabricator.wikimedia.org/T227226 [11:04:11] matthiasmullie, deployed. [11:04:20] kart_, your patch is next [11:04:36] Urbanecm: thanks! [11:04:40] yw [11:04:52] Urbanecm: OK [11:04:58] (03Merged) 10jenkins-bot: Configuration migration for Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517933 (https://phabricator.wikimedia.org/T87985) (owner: 10Awight) [11:05:16] (03CR) 10jenkins-bot: Configuration migration for Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517933 (https://phabricator.wikimedia.org/T87985) (owner: 10Awight) [11:05:25] (03CR) 10Jbond: [C: 03+1] "The initial comment surrounding this code is somewhat confusing as it mentions Ubuntu but the code dosn't. Either way i validated this wi" [puppet] - 10https://gerrit.wikimedia.org/r/521454 (owner: 10Muehlenhoff) [11:05:32] kart_, your patch is on mwdebug1002 [11:06:06] Urbanecm: difficult to test, but let me check nothing breaks. [11:06:11] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521390 (https://phabricator.wikimedia.org/T227546) (owner: 10DannyS712) [11:06:13] sure [11:06:38] (03PS4) 10Urbanecm: Clean up `wgNamespacesWithSubpages` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521390 (https://phabricator.wikimedia.org/T227546) (owner: 10DannyS712) [11:06:48] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521390 (https://phabricator.wikimedia.org/T227546) (owner: 10DannyS712) [11:07:28] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Tarrow) @akosiaris Happy to say that were are now have code have sending out request metrics form master. In our investigation about going... [11:07:50] (03Merged) 10jenkins-bot: Clean up `wgNamespacesWithSubpages` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521390 (https://phabricator.wikimedia.org/T227546) (owner: 10DannyS712) [11:08:19] (03CR) 10jenkins-bot: Clean up `wgNamespacesWithSubpages` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521390 (https://phabricator.wikimedia.org/T227546) (owner: 10DannyS712) [11:08:26] Urbanecm: go ahead. [11:08:30] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521383 (https://phabricator.wikimedia.org/T227000) (owner: 10DannyS712) [11:08:33] ok kart_, syncing [11:09:21] (03Merged) 10jenkins-bot: Disable flaggedrevs for hewikisource main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521383 (https://phabricator.wikimedia.org/T227000) (owner: 10DannyS712) [11:09:46] (03PS1) 10Muehlenhoff: Add DNS entries for orespoolcounter[12]00[34] [dns] - 10https://gerrit.wikimedia.org/r/521470 (https://phabricator.wikimedia.org/T227567) [11:09:51] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[:gerrit:517933|Configuration migration for Translate]] (T87985) (duration: 00m 49s) [11:09:56] kart_, synced [11:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:57] T87985: Convert Translate to use extension registration - https://phabricator.wikimedia.org/T87985 [11:10:03] Urbanecm: Thanks! [11:10:06] yw [11:10:47] (03CR) 10jenkins-bot: Disable flaggedrevs for hewikisource main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521383 (https://phabricator.wikimedia.org/T227000) (owner: 10DannyS712) [11:11:20] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:521390|Clean up `wgNamespacesWithSubpages` to remove unneeded entries]] (T227546) (duration: 00m 49s) [11:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:26] T227546: Clean up `wgNamespacesWithSubpages` to remove unneeded entries - https://phabricator.wikimedia.org/T227546 [11:12:39] !log urbanecm@deploy1001 Synchronized wmf-config/flaggedrevs.php: SWAT: [[:gerrit:521383|Disable flaggedrevs for hewikisource main page]] (T227000) (duration: 00m 48s) [11:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:44] T227000: Configuration request for Flagged Reviews at Hebrew Wikisource - https://phabricator.wikimedia.org/T227000 [11:13:04] !log EU SWAT done [11:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] Drop obsolete osm spec test [puppet] - 10https://gerrit.wikimedia.org/r/521451 (owner: 10Muehlenhoff) [11:17:56] (03PS2) 10Alexandros Kosiaris: Drop obsolete osm spec test [puppet] - 10https://gerrit.wikimedia.org/r/521451 (owner: 10Muehlenhoff) [11:17:59] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Drop obsolete osm spec test [puppet] - 10https://gerrit.wikimedia.org/r/521451 (owner: 10Muehlenhoff) [11:20:41] (03CR) 10Alexandros Kosiaris: [C: 04-1] helmfile,k8s: creating hfenv variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [11:21:52] (03PS1) 10Vgutierrez: ncredir: Move last resource return to a location block [puppet] - 10https://gerrit.wikimedia.org/r/521473 (https://phabricator.wikimedia.org/T133548) [11:22:11] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Move last resource return to a location block [puppet] - 10https://gerrit.wikimedia.org/r/521473 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [11:22:28] (03CR) 10Alexandros Kosiaris: [C: 03+1] uwsgi::app: add notes_url for services using uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/521376 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [11:24:17] (03PS1) 10Elukey: aptrepo: add more packages to the amd-rocm's whitelist [puppet] - 10https://gerrit.wikimedia.org/r/521475 (https://phabricator.wikimedia.org/T224723) [11:24:19] (03PS1) 10Elukey: profile::statistics::gpu: fix packages installed [puppet] - 10https://gerrit.wikimedia.org/r/521476 (https://phabricator.wikimedia.org/T224723) [11:25:29] (03CR) 10Elukey: [C: 03+2] aptrepo: add more packages to the amd-rocm's whitelist [puppet] - 10https://gerrit.wikimedia.org/r/521475 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [11:26:23] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:26:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:41] (03PS2) 10Elukey: profile::statistics::gpu: fix packages installed [puppet] - 10https://gerrit.wikimedia.org/r/521476 (https://phabricator.wikimedia.org/T224723) [11:27:37] (03CR) 10Elukey: [C: 03+2] profile::statistics::gpu: fix packages installed [puppet] - 10https://gerrit.wikimedia.org/r/521476 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [11:29:56] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:29:58] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:46] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:30:47] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:07] (03PS2) 10Vgutierrez: ncredir: Move last resource return to a location block [puppet] - 10https://gerrit.wikimedia.org/r/521473 (https://phabricator.wikimedia.org/T133548) [11:35:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] typo fix: termbox codfw should have codfw IP [dns] - 10https://gerrit.wikimedia.org/r/521457 (owner: 10Tarrow) [11:35:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks, missed that." [dns] - 10https://gerrit.wikimedia.org/r/521457 (owner: 10Tarrow) [11:37:20] (03CR) 10Alexandros Kosiaris: nrpe: add notes_url parameter to spec and tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521386 (owner: 10Dzahn) [11:39:12] (03PS1) 10Elukey: profile::statistics::gpu: add rocm-smi to the packages required [puppet] - 10https://gerrit.wikimedia.org/r/521478 (https://phabricator.wikimedia.org/T224723) [11:40:19] (03CR) 10Elukey: [C: 03+2] profile::statistics::gpu: add rocm-smi to the packages required [puppet] - 10https://gerrit.wikimedia.org/r/521478 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [11:40:28] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [11:41:56] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [11:42:49] (03PS2) 10Muehlenhoff: Add DNS entries for orespoolcounter[12]00[34] [dns] - 10https://gerrit.wikimedia.org/r/521470 (https://phabricator.wikimedia.org/T227567) [11:45:29] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entries for orespoolcounter[12]00[34] [dns] - 10https://gerrit.wikimedia.org/r/521470 (https://phabricator.wikimedia.org/T227567) (owner: 10Muehlenhoff) [11:46:59] (03CR) 10Ema: [C: 03+1] ncredir: Move last resource return to a location block [puppet] - 10https://gerrit.wikimedia.org/r/521473 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [11:47:40] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:47:41] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:57] (03CR) 10Ema: [C: 03+1] No need to remove eject any longer [puppet] - 10https://gerrit.wikimedia.org/r/521454 (owner: 10Muehlenhoff) [11:51:31] (03PS1) 10Elukey: Enable base::firewall on stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/521479 (https://phabricator.wikimedia.org/T170826) [11:54:16] (03CR) 10Muehlenhoff: "This only configures the spark port, but doesn't enable base::firewall?" [puppet] - 10https://gerrit.wikimedia.org/r/521479 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [11:54:56] (03CR) 10Alexandros Kosiaris: "I 'd be good to have a task for this, plus some in-person verification" [puppet] - 10https://gerrit.wikimedia.org/r/519941 (owner: 10Aaron Schulz) [11:55:48] moritzm: ahahahh I think that I need to have lunch and come back :D [11:56:49] (03PS2) 10Elukey: Enable base::firewall on stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/521479 (https://phabricator.wikimedia.org/T170826) [11:56:55] done, sorry :) [11:57:30] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [11:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, the NRPE/rsync/SSH/Exim/Prometheus ports will be handled by the common Ferm rules and the rpcbind-related ports will be correc" [puppet] - 10https://gerrit.wikimedia.org/r/521479 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [12:02:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [12:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:44] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [12:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:34] (03CR) 10Hashar: [C: 04-1] "I believe there is some back compatibility that needs to be added." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [12:08:56] (03PS1) 10Ema: cache: refresh VTC tests after cache_upload conversion [puppet] - 10https://gerrit.wikimedia.org/r/521481 (https://phabricator.wikimedia.org/T226589) [12:09:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [12:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:34] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [12:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:45] !log jmm@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [12:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:01] (03CR) 10Hashar: [C: 03+1] Add git buildpackage configuration [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519363 (owner: 10Hashar) [12:13:48] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [12:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:15] (03CR) 10Ema: [C: 03+2] cache: refresh VTC tests after cache_upload conversion [puppet] - 10https://gerrit.wikimedia.org/r/521481 (https://phabricator.wikimedia.org/T226589) (owner: 10Ema) [12:18:51] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [12:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:46] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [12:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:08] (03PS2) 10Muehlenhoff: No need to remove eject any longer [puppet] - 10https://gerrit.wikimedia.org/r/521454 [12:26:21] (03CR) 10Muehlenhoff: [C: 03+2] No need to remove eject any longer [puppet] - 10https://gerrit.wikimedia.org/r/521454 (owner: 10Muehlenhoff) [12:26:48] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [12:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:56] (03PS1) 10Jbond: admin: new user group secteam-users [puppet] - 10https://gerrit.wikimedia.org/r/521483 (https://phabricator.wikimedia.org/T223463) [12:28:58] (03PS1) 10Jbond: admin: new group add secteam-admin [puppet] - 10https://gerrit.wikimedia.org/r/521484 (https://phabricator.wikimedia.org/T223463) [12:29:39] (03CR) 10jerkins-bot: [V: 04-1] admin: new user group secteam-users [puppet] - 10https://gerrit.wikimedia.org/r/521483 (https://phabricator.wikimedia.org/T223463) (owner: 10Jbond) [12:29:57] (03CR) 10jerkins-bot: [V: 04-1] admin: new group add secteam-admin [puppet] - 10https://gerrit.wikimedia.org/r/521484 (https://phabricator.wikimedia.org/T223463) (owner: 10Jbond) [12:36:49] (03PS2) 10Jbond: admin: new user group secteam-users [puppet] - 10https://gerrit.wikimedia.org/r/521483 (https://phabricator.wikimedia.org/T223463) [12:39:40] (03PS1) 10Muehlenhoff: Add DHCP entries for orespoolcounter[12]00[34] [puppet] - 10https://gerrit.wikimedia.org/r/521485 (https://phabricator.wikimedia.org/T227567) [12:40:39] (03CR) 10Jbond: [C: 03+2] admin: new user group secteam-users [puppet] - 10https://gerrit.wikimedia.org/r/521483 (https://phabricator.wikimedia.org/T223463) (owner: 10Jbond) [12:40:50] (03PS3) 10Jbond: admin: new user group secteam-users [puppet] - 10https://gerrit.wikimedia.org/r/521483 (https://phabricator.wikimedia.org/T223463) [12:47:36] (03PS2) 10Muehlenhoff: Add DHCP entries for orespoolcounter[12]00[34] [puppet] - 10https://gerrit.wikimedia.org/r/521485 (https://phabricator.wikimedia.org/T227567) [12:48:03] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10jbond) @chasemp I have gone ahead and created the `secteam-users` group (renamed from secteam to match our convention) so you... [12:54:28] (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP entries for orespoolcounter[12]00[34] [puppet] - 10https://gerrit.wikimedia.org/r/521485 (https://phabricator.wikimedia.org/T227567) (owner: 10Muehlenhoff) [12:56:04] (03PS2) 10Ppchelko: Remove references to pdfrender from RESTBase. [puppet] - 10https://gerrit.wikimedia.org/r/519526 (https://phabricator.wikimedia.org/T226675) [13:04:10] (03PS1) 10Ema: vcl: stop calling WP Zero subroutines [puppet] - 10https://gerrit.wikimedia.org/r/521488 (https://phabricator.wikimedia.org/T213769) [13:05:28] (03PS5) 10Ema: Revert "Block WP Zero users from accessing Phabricator uploads" [puppet] - 10https://gerrit.wikimedia.org/r/479399 (https://phabricator.wikimedia.org/T213769) (owner: 10MaxSem) [13:13:56] PROBLEM - puppet last run on dns4002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:19:52] (03PS4) 10Andrew Bogott: bootstrap-vz: configure base image to use sssd for buster [puppet] - 10https://gerrit.wikimedia.org/r/521278 (https://phabricator.wikimedia.org/T227475) [13:21:11] (03CR) 10Andrew Bogott: [C: 03+2] bootstrap-vz: configure base image to use sssd for buster [puppet] - 10https://gerrit.wikimedia.org/r/521278 (https://phabricator.wikimedia.org/T227475) (owner: 10Andrew Bogott) [13:22:47] (03PS2) 10Andrew Bogott: cloud: sssd: use by default for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/521443 (https://phabricator.wikimedia.org/T227475) (owner: 10Arturo Borrero Gonzalez) [13:24:08] (03CR) 10Andrew Bogott: [C: 03+2] cloud: sssd: use by default for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/521443 (https://phabricator.wikimedia.org/T227475) (owner: 10Arturo Borrero Gonzalez) [13:24:44] (03CR) 10Elukey: [C: 03+2] Enable base::firewall on stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/521479 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [13:24:54] (03PS3) 10Elukey: Enable base::firewall on stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/521479 (https://phabricator.wikimedia.org/T170826) [13:25:07] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [13:25:09] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:48] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [13:25:49] (03CR) 10BBlack: [C: 03+1] Revert "Block WP Zero users from accessing Phabricator uploads" [puppet] - 10https://gerrit.wikimedia.org/r/479399 (https://phabricator.wikimedia.org/T213769) (owner: 10MaxSem) [13:25:49] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:34] !log enable base::firewall on stat1007 [13:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:38] (03PS6) 10Ema: Revert "Block WP Zero users from accessing Phabricator uploads" [puppet] - 10https://gerrit.wikimedia.org/r/479399 (https://phabricator.wikimedia.org/T213769) (owner: 10MaxSem) [13:27:21] 10Operations, 10Traffic, 10Zero, 10Patch-For-Review: Zero VCL removal - https://phabricator.wikimedia.org/T213769 (10ema) 05Stalled→03Open [13:27:23] (03CR) 10Ema: [C: 03+2] Revert "Block WP Zero users from accessing Phabricator uploads" [puppet] - 10https://gerrit.wikimedia.org/r/479399 (https://phabricator.wikimedia.org/T213769) (owner: 10MaxSem) [13:38:01] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10sbassett) @jbond - @chasemp is on sabbatical until September, so it'll probably be a little while before this can be tested and... [13:38:41] (03PS2) 10Ema: vcl: stop calling WP Zero subroutines, remove vcl file [puppet] - 10https://gerrit.wikimedia.org/r/521488 (https://phabricator.wikimedia.org/T213769) [13:41:10] RECOVERY - puppet last run on dns4002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:42:26] 10Operations, 10Discovery, 10Traffic, 10WMDE-Analytics-Engineering, and 3 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10Ottomata) @Addshore, just saw T218710 and clicked through to here. If you use https://wikitech.wikimedia.org/wiki/HTTP_proxy, y... [13:48:32] (03PS1) 10Elukey: role::swap: enable base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/521494 (https://phabricator.wikimedia.org/T170826) [13:51:01] (03PS1) 10Jbond: wmcs - ldap: ensure buster uses sudo not sudoldap [puppet] - 10https://gerrit.wikimedia.org/r/521495 [13:51:15] (03PS7) 10Fsero: helmfile,k8s: creating hfenv variables [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) [13:52:47] (03PS1) 10Andrew Bogott: cloud: sssd: more checks to ensure we use sssd on Buster [puppet] - 10https://gerrit.wikimedia.org/r/521496 (https://phabricator.wikimedia.org/T227475) [13:52:49] (03CR) 10BBlack: vcl: stop calling WP Zero subroutines, remove vcl file (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/521488 (https://phabricator.wikimedia.org/T213769) (owner: 10Ema) [13:53:42] (03CR) 10jerkins-bot: [V: 04-1] cloud: sssd: more checks to ensure we use sssd on Buster [puppet] - 10https://gerrit.wikimedia.org/r/521496 (https://phabricator.wikimedia.org/T227475) (owner: 10Andrew Bogott) [13:54:13] (03PS1) 10Urbanecm: Disable local uploads on wuuwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521497 (https://phabricator.wikimedia.org/T226764) [13:54:50] cmjohnson1: we're ready to go on T222960 whenever you are, the machine can be taken down at any time [13:54:50] T222960: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 [13:55:30] (03PS2) 10Jbond: wmcs - ldap: ensure buster uses sudo not sudoldap [puppet] - 10https://gerrit.wikimedia.org/r/521495 [13:55:53] (03CR) 10Fsero: [C: 03+2] helmfile,k8s: creating hfenv variables [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [13:56:16] (03PS2) 10Andrew Bogott: cloud: sssd: more checks to ensure we use sssd on Buster [puppet] - 10https://gerrit.wikimedia.org/r/521496 (https://phabricator.wikimedia.org/T227475) [13:56:43] (03CR) 10Fsero: [C: 03+2] "@ako" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [13:57:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. The Jupyterhub proxy should only be accessed from localhost, the NRPE/rsync/SSH/Exim/Prometheus ports will be handled by the c" [puppet] - 10https://gerrit.wikimedia.org/r/521494 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [13:58:43] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [13:59:07] (03PS8) 10Fsero: helmfile,k8s: creating hfenv variables [puppet] - 10https://gerrit.wikimedia.org/r/521441 (https://phabricator.wikimedia.org/T212130) [13:59:35] !log installing orespoolcounter200[34] T227567 [13:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:40] T227567: Site: eqiad/codfw VM for ORES pool counters - https://phabricator.wikimedia.org/T227567 [14:00:28] (03PS1) 10Muehlenhoff: Add orespoolcounter[12]00[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/521498 [14:00:29] (03CR) 10Andrew Bogott: [C: 03+1] wmcs - ldap: ensure buster uses sudo not sudoldap [puppet] - 10https://gerrit.wikimedia.org/r/521495 (owner: 10Jbond) [14:01:28] (03CR) 10Jbond: [C: 03+2] wmcs - ldap: ensure buster uses sudo not sudoldap [puppet] - 10https://gerrit.wikimedia.org/r/521495 (owner: 10Jbond) [14:01:47] (03PS3) 10Jbond: wmcs - ldap: ensure buster uses sudo not sudoldap [puppet] - 10https://gerrit.wikimedia.org/r/521495 [14:03:02] urandom okay! I will ping you once it’s ready for reinstall [14:05:00] (03PS2) 10Elukey: role::swap: enable base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/521494 (https://phabricator.wikimedia.org/T170826) [14:05:59] (03PS3) 10Andrew Bogott: cloud: sssd: more checks to ensure we use sssd on Buster [puppet] - 10https://gerrit.wikimedia.org/r/521496 (https://phabricator.wikimedia.org/T227475) [14:06:08] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/srv/deployment-charts/helmfile.d/services/staging/graphoid/.hfenv],File[/srv/deployment-charts/helmfile.d/services/eqiad/graphoid/.hfenv],File[/srv/deployment-charts/helmfile.d/services/codfw/graphoid/.hfenv] [14:07:07] (03CR) 10Andrew Bogott: [C: 03+2] cloud: sssd: more checks to ensure we use sssd on Buster [puppet] - 10https://gerrit.wikimedia.org/r/521496 (https://phabricator.wikimedia.org/T227475) (owner: 10Andrew Bogott) [14:07:46] (03PS1) 10Muehlenhoff: Switch all orespoolcounter VMs to virtual.cfg [puppet] - 10https://gerrit.wikimedia.org/r/521503 (https://phabricator.wikimedia.org/T227567) [14:07:52] (03PS1) 10Ppchelko: Undeploy pdfrender service. LVS/confd cleanup. [puppet] - 10https://gerrit.wikimedia.org/r/521504 (https://phabricator.wikimedia.org/T226675) [14:07:54] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 2 minutes ago with 5 failures. Failed resources (up to 3 shown): File[/srv/deployment-charts/helmfile.d/admin/eqiad/.hfenv],File[/srv/deployment-charts/helmfile.d/admin/codfw/.hfenv],File[/srv/deployment-charts/helmfile.d/services/staging/graphoid/.hfenv],File[/srv/deployment-charts/helmfile.d/services/eqiad/graphoid/.hfenv] [14:08:09] (03PS3) 10Ema: vcl: remove WP Zero code [puppet] - 10https://gerrit.wikimedia.org/r/521488 (https://phabricator.wikimedia.org/T213769) [14:08:49] (03PS2) 10Ppchelko: Undeploy pdfrender service. LVS/confd cleanup. [puppet] - 10https://gerrit.wikimedia.org/r/521504 (https://phabricator.wikimedia.org/T226675) [14:09:07] (03CR) 10Ema: vcl: remove WP Zero code (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/521488 (https://phabricator.wikimedia.org/T213769) (owner: 10Ema) [14:09:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch all orespoolcounter VMs to virtual.cfg [puppet] - 10https://gerrit.wikimedia.org/r/521503 (https://phabricator.wikimedia.org/T227567) (owner: 10Muehlenhoff) [14:09:25] urandom rb1017 is going to private vlan correct? [14:09:48] i see all the other restbases in private [14:09:54] cmjohnson1: no idea; I don't know what that means in this context [14:10:13] is this going to be setup the same as restbase1024 [14:10:21] yes [14:10:25] okay...cool! thanks [14:10:49] cmjohnson1: I guess it gets a new IP as a result, but these machines also have 3 additional IPs [14:10:56] cmjohnson1: for a total of 4 [14:11:19] (03PS2) 10Muehlenhoff: Switch all orespoolcounter VMs to virtual.cfg [puppet] - 10https://gerrit.wikimedia.org/r/521503 (https://phabricator.wikimedia.org/T227567) [14:11:52] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:12:10] oh! well that may not be me...i can get it so you can re-install everything but may have to defer to someone else for the additional ip addresses [14:12:38] (03CR) 10Muehlenhoff: [C: 03+2] Switch all orespoolcounter VMs to virtual.cfg [puppet] - 10https://gerrit.wikimedia.org/r/521503 (https://phabricator.wikimedia.org/T227567) (owner: 10Muehlenhoff) [14:13:59] (03PS1) 10Fsero: helmfile,k8s: bug: we should require the directory if not fails [puppet] - 10https://gerrit.wikimedia.org/r/521505 (https://phabricator.wikimedia.org/T212130) [14:14:19] (03PS1) 10Andrew Bogott: labs_bootstrapvz: ignore /etc/nslcd.conf on buster [puppet] - 10https://gerrit.wikimedia.org/r/521506 (https://phabricator.wikimedia.org/T227475) [14:15:15] (03PS2) 10Andrew Bogott: labs_bootstrapvz: ignore /etc/nslcd.conf on buster [puppet] - 10https://gerrit.wikimedia.org/r/521506 (https://phabricator.wikimedia.org/T227475) [14:15:24] (03PS1) 10Jbond: wmcs: fix type in variable names [puppet] - 10https://gerrit.wikimedia.org/r/521508 [14:15:49] (03CR) 10Andrew Bogott: [C: 03+2] labs_bootstrapvz: ignore /etc/nslcd.conf on buster [puppet] - 10https://gerrit.wikimedia.org/r/521506 (https://phabricator.wikimedia.org/T227475) (owner: 10Andrew Bogott) [14:16:32] (03CR) 10Jbond: [C: 03+2] wmcs: fix type in variable names [puppet] - 10https://gerrit.wikimedia.org/r/521508 (owner: 10Jbond) [14:16:40] (03PS2) 10Jbond: wmcs: fix type in variable names [puppet] - 10https://gerrit.wikimedia.org/r/521508 [14:18:50] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) Added documentation in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU [14:21:14] (03PS1) 10Ema: varnish: remove WP Zero puppetization [puppet] - 10https://gerrit.wikimedia.org/r/521510 (https://phabricator.wikimedia.org/T213769) [14:21:39] !log tarrow@deploy1001 scap-helm termbox upgrade staging stable/termbox -f termbox-staging-values.yaml [namespace: termbox, clusters: staging] [14:21:40] !log tarrow@deploy1001 scap-helm termbox cluster staging completed [14:21:40] !log tarrow@deploy1001 scap-helm termbox finished [14:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:53] (03CR) 10Fsero: "PCC happy https://puppet-compiler.wmflabs.org/compiler1001/17273/deploy1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/521505 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [14:22:04] (03PS2) 10Fsero: helmfile,k8s: bug: we should require the directory if not fails [puppet] - 10https://gerrit.wikimedia.org/r/521505 (https://phabricator.wikimedia.org/T212130) [14:24:46] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10Pchelolo) [14:25:08] (03PS4) 10Ema: vcl: remove WP Zero code [puppet] - 10https://gerrit.wikimedia.org/r/521488 (https://phabricator.wikimedia.org/T213769) [14:25:10] (03PS1) 10Ema: vcl: do not set WP Zero X-Carrier headers [puppet] - 10https://gerrit.wikimedia.org/r/521511 (https://phabricator.wikimedia.org/T213769) [14:25:26] (03CR) 10Fsero: [C: 03+2] helmfile,k8s: bug: we should require the directory if not fails [puppet] - 10https://gerrit.wikimedia.org/r/521505 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [14:25:29] cmjohnson1: so i can reinstall everything? [14:25:56] cmjohnson1: asking because there isn't much of this that I can do [14:26:47] cmjohnson1: once up with the right puppet role, I can kick off the Cassandra bootstraps, but that's pretty much it [14:26:55] (03PS1) 10Fsero: helmfile,k8s: bug: we should require the directory if not fails [puppet] - 10https://gerrit.wikimedia.org/r/521512 (https://phabricator.wikimedia.org/T212130) [14:27:12] (03PS3) 10Ottomata: eventstreams: add admins contact to eventstreams check [puppet] - 10https://gerrit.wikimedia.org/r/520475 (https://phabricator.wikimedia.org/T227065) (owner: 10Herron) [14:27:17] (03CR) 10jerkins-bot: [V: 04-1] helmfile,k8s: bug: we should require the directory if not fails [puppet] - 10https://gerrit.wikimedia.org/r/521512 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [14:28:36] (03PS2) 10Fsero: helmfile,k8s: bug: we should require the directory if not fails [puppet] - 10https://gerrit.wikimedia.org/r/521512 (https://phabricator.wikimedia.org/T212130) [14:28:38] !log rebooting cloudnet1004.eqiad T224228 [14:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:44] (03CR) 10Ottomata: [C: 03+2] eventstreams: add admins contact to eventstreams check [puppet] - 10https://gerrit.wikimedia.org/r/520475 (https://phabricator.wikimedia.org/T227065) (owner: 10Herron) [14:28:45] !log jiji@deploy1001 Started deploy [cpjobqueue/deploy@8517fec]: Migrating cirrus* jobs to PHP7 - T219150 [14:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:59] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [14:29:38] (03PS1) 10Andrew Bogott: labs_bootstrapvz: fix path to sssd.conf [puppet] - 10https://gerrit.wikimedia.org/r/521513 [14:29:48] !log jiji@deploy1001 Finished deploy [cpjobqueue/deploy@8517fec]: Migrating cirrus* jobs to PHP7 - T219150 (duration: 01m 02s) [14:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:31] PROBLEM - Host cloudnet1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:38] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/srv/deployment-charts/helmfile.d/services/staging/graphoid/.hfenv],File[/srv/deployment-charts/helmfile.d/services/eqiad/graphoid/.hfenv],File[/srv/deployment-charts/helmfile.d/services/codfw/graphoid/.hfenv] [14:32:05] expected as per SAL (cloudnet1004) [14:32:16] it paged however [14:32:29] it did page indeed [14:32:37] RECOVERY - Host cloudnet1004 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:33:57] (03CR) 10Fsero: [C: 03+2] helmfile,k8s: bug: we should require the directory if not fails [puppet] - 10https://gerrit.wikimedia.org/r/521512 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [14:34:11] (03PS3) 10Fsero: helmfile,k8s: bug: we should require the directory if not fails [puppet] - 10https://gerrit.wikimedia.org/r/521512 (https://phabricator.wikimedia.org/T212130) [14:34:52] sorry for the page... [14:36:03] 10Operations, 10Puppet, 10Packaging: upgrade puppet master servers - https://phabricator.wikimedia.org/T227587 (10jbond) [14:38:24] (03PS1) 10Jbond: puppet: refactor remove puppetdb_major_version [puppet] - 10https://gerrit.wikimedia.org/r/521514 [14:38:30] np jeh ! [14:39:04] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:39:09] (03CR) 10jerkins-bot: [V: 04-1] puppet: refactor remove puppetdb_major_version [puppet] - 10https://gerrit.wikimedia.org/r/521514 (owner: 10Jbond) [14:40:18] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/521514 (owner: 10Jbond) [14:40:24] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:40:32] (03CR) 10jerkins-bot: [V: 04-1] puppet: refactor remove puppetdb_major_version [puppet] - 10https://gerrit.wikimedia.org/r/521514 (owner: 10Jbond) [14:42:05] PROBLEM - Host restbase1017 is DOWN: PING CRITICAL - Packet loss = 100% [14:42:23] (03CR) 10SBassett: [C: 03+1] "Would be nice to get this deployed and tested!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521233 (https://phabricator.wikimedia.org/T181217) (owner: 10Reedy) [14:42:54] (03PS2) 10Ottomata: Refine mediawiki_page* with schema aware Refine job [puppet] - 10https://gerrit.wikimedia.org/r/521328 (https://phabricator.wikimedia.org/T211248) [14:42:55] !log reject RPKI invalids on ulsfo peering link - T220669 [14:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:00] PROBLEM - Host restbase1017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:43:01] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [14:43:34] PROBLEM - Host ms-be1018 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:56] (03CR) 10Ottomata: [C: 03+2] Refine mediawiki_page* with schema aware Refine job [puppet] - 10https://gerrit.wikimedia.org/r/521328 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [14:44:38] (03CR) 10Elukey: "Moritz, I think it is better to split the change in two, namely:" [puppet] - 10https://gerrit.wikimedia.org/r/521494 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [14:44:42] (03PS2) 10Jbond: puppet: refactor remove puppetdb_major_version [puppet] - 10https://gerrit.wikimedia.org/r/521514 (https://phabricator.wikimedia.org/T227587) [14:45:27] (03CR) 10jerkins-bot: [V: 04-1] puppet: refactor remove puppetdb_major_version [puppet] - 10https://gerrit.wikimedia.org/r/521514 (https://phabricator.wikimedia.org/T227587) (owner: 10Jbond) [14:46:27] (03CR) 10Muehlenhoff: [C: 03+1] "Existing established connections should not be affected, when applying Ferm your SSH connection e.g. remains, but sure, makes sense to app" [puppet] - 10https://gerrit.wikimedia.org/r/521494 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [14:47:39] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) Confirmed with a given test peer that was sending us RPKI invalids and unknown. We now only receive the unknown. And for a given invalid prefix that we used to receive via peering is now going through transit. [14:47:47] I'm looking into ms-be1018 [14:48:00] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) Makes sense, I am now wondering if we should create a generic and configurable alarm or not :) [14:48:37] mhhh server power is off [14:48:48] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:49:03] cmjohnson1: any chance ms-be1018 powered off could be related to restbase1017's move ? [14:49:14] (03Abandoned) 10Elukey: role::swap: enable base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/521494 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [14:49:30] RECOVERY - Host restbase1017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [14:50:11] !log installing orespoolcounter100[34] T227567 [14:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:19] T227567: Site: eqiad/codfw VM for ORES pool counters - https://phabricator.wikimedia.org/T227567 [14:50:28] godog: i am in that rack but it's not related [14:50:39] the power still looks good on my end but I do see that's it is off [14:50:58] ok thanks, I'll power it back on [14:51:08] i just did [14:51:12] (03PS1) 10Elukey: role::swap: restrict Spark driver port range [puppet] - 10https://gerrit.wikimedia.org/r/521516 (https://phabricator.wikimedia.org/T170826) [14:51:28] (03PS3) 10Jbond: puppet: refactor remove puppetdb_major_version [puppet] - 10https://gerrit.wikimedia.org/r/521514 (https://phabricator.wikimedia.org/T227587) [14:51:41] ok [14:51:42] (03PS4) 10Jbond: puppet: refactor remove puppetdb_major_version [puppet] - 10https://gerrit.wikimedia.org/r/521514 (https://phabricator.wikimedia.org/T227587) [14:51:48] (03CR) 10Elukey: [C: 03+2] role::swap: restrict Spark driver port range [puppet] - 10https://gerrit.wikimedia.org/r/521516 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [14:51:50] godog: i may have accidentally hit the power button...i racked rb1017 just above it and it's a tricky rack to get things in an out of because of a pole directly in front of it [14:51:56] im sorry [14:52:08] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/521514 (https://phabricator.wikimedia.org/T227587) (owner: 10Jbond) [14:52:34] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:53:10] !log repooled elastic2054 - T227298 [14:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:15] T227298: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 [14:53:29] cmjohnson1: ok np, I'll check everything is fine on the host [14:54:12] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:54:16] RECOVERY - Host ms-be1018 is UP: PING OK - Packet loss = 0%, RTA = 2.22 ms [14:54:53] (03PS1) 10Ottomata: Migrate mediawiki.recentchange stream to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521517 (https://phabricator.wikimedia.org/T211248) [14:55:04] (03PS2) 10Ottomata: Migrate mediawiki.recentchange stream to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521517 (https://phabricator.wikimedia.org/T211248) [14:55:07] (03CR) 10jerkins-bot: [V: 04-1] Migrate mediawiki.recentchange stream to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521517 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [14:56:05] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [14:56:44] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/521514 (https://phabricator.wikimedia.org/T227587) (owner: 10Jbond) [14:58:38] (03CR) 10Effie Mouzeli: [C: 03+1] uwsgi::app: add notes_url for services using uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/521376 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [14:59:30] PROBLEM - confd service on bast3002 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:59:52] I am having a quick look at this [14:59:53] (03PS4) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [14:59:58] (03CR) 10Andrew Bogott: [C: 03+2] labs_bootstrapvz: fix path to sssd.conf [puppet] - 10https://gerrit.wikimedia.org/r/521513 (owner: 10Andrew Bogott) [15:00:39] (03PS2) 10Andrew Bogott: labs_bootstrapvz: fix path to sssd.conf [puppet] - 10https://gerrit.wikimedia.org/r/521513 [15:00:54] (03PS1) 10Andrew Bogott: cloud-vps hiera: move more ldap lookups to the ro replicas [puppet] - 10https://gerrit.wikimedia.org/r/521518 [15:02:03] (03PS1) 10Cmjohnson: Moving production dns entries for restbase1017 [dns] - 10https://gerrit.wikimedia.org/r/521519 (https://phabricator.wikimedia.org/T222960) [15:03:04] urandom ^ IP changes for restbase1017. [15:03:24] !log rebooting cloudnet1003.eqiad T224228 [15:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:52] 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 4 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Cmjohnson) restbase1017 has been moved to rack B5 network port updated DNS updated [15:09:14] 10Operations, 10Discovery, 10Traffic, 10WMDE-Analytics-Engineering, and 3 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10Ottomata) Ah, hm ok. Actually, @elukey why can't we allow the VIP IP? We did this in {T221690}, no? [15:10:16] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:11:41] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10jbond) @sbassett thanks for the info [15:13:12] !log reject RPKI invalids on Dallas peering link - T220669 [15:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:17] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [15:14:32] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/521514 (https://phabricator.wikimedia.org/T227587) (owner: 10Jbond) [15:14:42] (03PS6) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [15:15:08] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:15:38] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:15:57] (03PS14) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [15:16:10] RECOVERY - confd service on bast3002 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:17:03] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [15:19:45] (03CR) 10Daniel Kinzler: [C: 03+1] Specify CentralAuth session storage separately from per-wiki session storage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521409 (https://phabricator.wikimedia.org/T227097) (owner: 10BPirkle) [15:20:45] !log reject RPKI invalids on Singapore peering link - T220669 [15:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:51] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [15:21:58] (03CR) 10Daniel Kinzler: [C: 03+1] "Is it possible that the OAuth extension has the same issue? I don't know how that works exactly, I just noticed that it also uses wgSessio" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521409 (https://phabricator.wikimedia.org/T227097) (owner: 10BPirkle) [15:22:41] !log reboot ms-be2023 with oemhp_powerreg=os - T225713 [15:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:47] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [15:24:18] (03PS3) 10Alexandros Kosiaris: Undeploy pdfrender service. LVS/confd cleanup. [puppet] - 10https://gerrit.wikimedia.org/r/521504 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko) [15:27:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] "A bit aggressive (could be split up a bit more) but since we are killing a service the worst that can happen is that we revert and put it " [puppet] - 10https://gerrit.wikimedia.org/r/521504 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko) [15:27:37] (03PS2) 10Muehlenhoff: Add orespoolcounter[12]00[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/521498 [15:27:47] !log reboot ms-be2024 with oemhp_powerreg=os - T225713 [15:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:52] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [15:28:22] !log reject RPKI invalids on Chicago peering link - T220669 [15:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:28] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [15:28:46] andrewbogott: I am merging 90203075d9 as well on the puppetmasters [15:29:02] labs_bootstrapvz: fix path to sssd.conf (90203075d9) [15:29:09] ok, thanks [15:30:25] akosiaris Pchelolo feelsgood.png (re: pdfrender) \o/ [15:31:06] 10Operations, 10Discovery, 10Traffic, 10WMDE-Analytics-Engineering, and 3 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10elukey) Not really, I wish myself from the past added more info. I asked to @ayounsi and he didn't come up with a reason not to,... [15:31:10] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:31:16] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:33:05] !log restart pybal on lvs2006, lvs1016. Removal of pdfrender service T226675 [15:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:09] T226675: Undeploy electron service from WMF production - https://phabricator.wikimedia.org/T226675 [15:33:50] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /mnt/docker/overlay2/843610a04cd773214d6795a2696dcc669b539c8092219d1b436423a82698249e/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [15:36:04] PROBLEM - puppet last run on restbase1024 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [15:36:08] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [15:36:16] (03PS2) 10Andrew Bogott: cloud-vps hiera: move more ldap lookups to the ro replicas [puppet] - 10https://gerrit.wikimedia.org/r/521518 (https://phabricator.wikimedia.org/T46722) [15:36:51] (03CR) 10Jbond: "This is good to be reviewed, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/521514 (https://phabricator.wikimedia.org/T227587) (owner: 10Jbond) [15:36:56] akosiaris: I think you forgot https://gerrit.wikimedia.org/r/c/operations/puppet/+/519526 which the other one depended on [15:37:42] (03PS3) 10Alexandros Kosiaris: Remove references to pdfrender from RESTBase. [puppet] - 10https://gerrit.wikimedia.org/r/519526 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko) [15:37:44] indeed. Merging [15:37:49] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove references to pdfrender from RESTBase. [puppet] - 10https://gerrit.wikimedia.org/r/519526 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko) [15:37:58] PROBLEM - puppet last run on restbase1018 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [15:38:02] !log reject RPKI invalids on Amsterdam peering link - T220669 [15:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:07] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [15:38:29] (03CR) 10Cwhite: [C: 03+1] "LGTM (untested)" [puppet] - 10https://gerrit.wikimedia.org/r/521295 (owner: 10Jbond) [15:38:38] PROBLEM - puppet last run on restbase1019 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [15:38:47] urandom: everything has been moved and updated but I cannot get the server to pxe [15:38:53] !log restart pybal on lvs2003, lvs1015. Removal of pdfrender service T226675 [15:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:58] T226675: Undeploy electron service from WMF production - https://phabricator.wikimedia.org/T226675 [15:39:16] (03PS3) 10Andrew Bogott: cloud-vps hiera: move all cloud VMs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/521518 (https://phabricator.wikimedia.org/T46722) [15:39:58] PROBLEM - puppet last run on restbase2020 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [15:40:48] if restbase1017 gets reimaged, best to switch it to stretch? per https://phabricator.wikimedia.org/T224553 it was previoiusly running jessie [15:40:56] (03PS4) 10Andrew Bogott: Move all cloud VMs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/521518 (https://phabricator.wikimedia.org/T46722) [15:41:13] cmjohnson1, moritzm: oh, yeah, good point [15:41:17] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10MelchiorAelmans) Good job team!! [15:41:18] re: stretch [15:41:29] buster !!! :P [15:41:59] akosiaris: you have moxie my friend [15:42:30] PROBLEM - puppet last run on restbase2015 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [15:42:32] PROBLEM - puppet last run on authdns1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[gdnsd] [15:42:33] au contraire [15:43:02] RECOVERY - puppet last run on restbase1018 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:43:44] RECOVERY - puppet last run on restbase1019 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:43:50] PROBLEM - puppet last run on restbase1026 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [15:44:16] !log reject RPKI invalids on Ashburn peering links - T220669 [15:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:21] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [15:44:24] (03PS7) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [15:45:06] RECOVERY - puppet last run on restbase2020 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:45:11] (03PS15) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [15:45:16] (03CR) 10Cwhite: "Hashar, thanks for this! It's on my list to deal with the lintian error. Not sure if this should come before the lintian fix but I'm goo" [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519363 (owner: 10Hashar) [15:46:25] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [15:46:32] (03PS1) 10Alexandros Kosiaris: pdfrender: Remove discovery records [dns] - 10https://gerrit.wikimedia.org/r/521523 (https://phabricator.wikimedia.org/T226675) [15:47:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] pdfrender: Remove discovery records [dns] - 10https://gerrit.wikimedia.org/r/521523 (https://phabricator.wikimedia.org/T226675) (owner: 10Alexandros Kosiaris) [15:47:08] PROBLEM - puppet last run on restbase1020 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [15:47:48] RECOVERY - puppet last run on authdns1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:47:50] RECOVERY - puppet last run on restbase2015 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:48:52] RECOVERY - puppet last run on restbase1026 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:49:16] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) Thanks! We now reject RPKI invalids on all our private/public peering sessions. Next step is to review/merge https://gerrit.wikimedia.org/r/c/520337/ so we have almost real-time visibility on which % of our traffi... [15:51:00] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10Jakob_WMDE) @MoritzMuehlenhoff here is the public key: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAENSef8ugACShtUryIYbII3C0bwJ8D... [15:51:18] RECOVERY - puppet last run on restbase1024 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:51:22] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:51:33] (03PS1) 10Eevans: restbase: update rb1017 Cassandra instances for rack move [puppet] - 10https://gerrit.wikimedia.org/r/521525 (https://phabricator.wikimedia.org/T222960) [15:51:55] (03PS8) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [15:52:08] RECOVERY - puppet last run on restbase1020 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:52:08] (03PS16) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [15:53:06] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [15:53:33] cmjohnson1: ^^ the above seems like it must be missing something [15:53:55] if so, hopefully it's at least a starting point [15:56:30] (03PS1) 10Urbanecm: Remove fawikiquote HD logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521527 [15:56:30] urandom, discussing with Robh and he believes that I tried to install prior to the dns change being completed. I am going to give it 1 hour to see if it clears and try to re-install. Everything else appears to be correct. [15:56:50] (03PS1) 10Ottomata: Move wgRCFeeds settings from CommonSettings to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521528 (https://phabricator.wikimedia.org/T211248) [15:57:32] Ok [15:57:52] (03CR) 10jerkins-bot: [V: 04-1] Move wgRCFeeds settings from CommonSettings to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521528 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [15:57:58] (03PS2) 10Urbanecm: Remove fawikiquote HD logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521527 [15:58:03] (03PS9) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [15:58:21] (03PS17) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [15:59:23] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190709T1600). Please do the needful. [16:00:04] Amir1: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:55] (03PS2) 10Ottomata: Move wgRCFeeds settings from CommonSettings to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521528 (https://phabricator.wikimedia.org/T211248) [16:01:45] (03CR) 10jerkins-bot: [V: 04-1] Move wgRCFeeds settings from CommonSettings to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521528 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [16:02:56] wow jouncebot, rude [16:03:00] (03PS10) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [16:03:22] Amir1: here? I looked at your patch, although it seems T176875 might be moving? I'd really like to avoid hardcoding an host like that [16:03:22] T176875: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 [16:03:30] RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [16:05:39] godog: I like to avoid it too but it's hardcoded in the software [16:05:55] this way it's at least hard-coded in puppet, so it shows up in your greps [16:07:47] Amir1: ah! ok thanks for the context, please add this info to the commit message [16:09:40] godog: done [16:09:45] (03PS2) 10Ladsgroup: statistics: Add wdqs host to wmde statistcs configuration [puppet] - 10https://gerrit.wikimedia.org/r/520901 (https://phabricator.wikimedia.org/T218710) [16:10:57] thanks, that's helpful Amir1 [16:11:03] I'll merge it [16:11:22] (03PS3) 10Filippo Giunchedi: statistics: Add wdqs host to wmde statistcs configuration [puppet] - 10https://gerrit.wikimedia.org/r/520901 (https://phabricator.wikimedia.org/T218710) (owner: 10Ladsgroup) [16:11:42] (03CR) 10Filippo Giunchedi: [C: 03+2] statistics: Add wdqs host to wmde statistcs configuration [puppet] - 10https://gerrit.wikimedia.org/r/520901 (https://phabricator.wikimedia.org/T218710) (owner: 10Ladsgroup) [16:14:22] (03CR) 10Ottomata: "Ah sorry, Filippo sorry we are revisiting https://phabricator.wikimedia.org/T176875, I think this can be done with the svc url" [puppet] - 10https://gerrit.wikimedia.org/r/520901 (https://phabricator.wikimedia.org/T218710) (owner: 10Ladsgroup) [16:14:57] Thank you so much [16:15:04] (03PS18) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:16:04] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:16:50] mhhh haven't actually puppet-merge'd yet [16:16:56] Amir1: see ottomata's comment [16:17:04] (03PS11) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [16:17:20] ok going ahead for now, we'll change it to the svc url [16:17:22] (03PS19) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:18:12] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/520901 (https://phabricator.wikimedia.org/T218710) (owner: 10Ladsgroup) [16:18:26] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:20:08] (03PS12) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [16:20:20] (03PS20) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:21:32] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:23:12] (03PS1) 10Jbond: puppet: update puppet-temini package name on buster [puppet] - 10https://gerrit.wikimedia.org/r/521536 (https://phabricator.wikimedia.org/T227587) [16:23:21] (03PS13) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [16:23:39] (03PS21) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:24:04] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-Addshore: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10Addshore) [16:24:43] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:25:28] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/521536 (https://phabricator.wikimedia.org/T227587) (owner: 10Jbond) [16:26:11] (03PS14) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [16:27:29] (03PS22) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:28:28] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:29:27] (03PS15) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [16:29:28] !log reboot ms-be2025 with oemhp_powerreg=os - T225713 [16:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:33] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [16:29:38] (03PS23) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:30:55] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:38:46] (03PS24) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:39:43] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:40:13] (03PS1) 10KartikMistry: Add Niklas to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/521539 [16:41:36] (03PS16) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [16:42:02] (03PS25) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:42:37] !log reboot ms-be2026 with oemhp_powerreg=os - T225713 [16:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:41] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [16:42:59] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:46:53] (03CR) 10Krinkle: [C: 04-1] "For this to be safe, $wmgUseEventBus would have to be removed first." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521528 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [16:47:45] (03PS26) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:48:39] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:49:51] (03PS1) 10Nuria: Disabling temporarily the 3rd party filter for EL events [puppet] - 10https://gerrit.wikimedia.org/r/521541 (https://phabricator.wikimedia.org/T227150) [16:50:25] (03PS27) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:54:03] !log reboot ms-be2027 with oemhp_powerreg=os - T225713 [16:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:08] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [16:56:25] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /mnt/docker/overlay2/42082d324583d16691c4e0ddaf2572716f99740ea22f6ad23517419a3a3e3356/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [16:59:16] (03PS3) 10Ottomata: Use wgEventServiceStreamConfig to configure wgRCFeeds['eventbus'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521528 (https://phabricator.wikimedia.org/T211248) [16:59:19] RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [16:59:49] (03CR) 10Ottomata: "Ok, different idea. This way we can configure the recentchange stream destination the same way we configure the others." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521528 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [16:59:58] !log reboot ms-be2039 with oemhp_powerreg=os - T225713 [17:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] cscott, arlolra, subbu, and halfak: (Dis)respected human, time to deploy Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190709T1700). Please do the needful. [17:00:04] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [17:05:21] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Ottomata) Eric needs the analytics-search user to be able to access the swift auth file so his Oozie jobs can upload to swift. analytics-search... [17:07:50] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Nuria) @Ottomata , +1 to that idea [17:08:25] urandom: so im finishing the install on https://phabricator.wikimedia.org/T222960 [17:08:34] once puppet is running i assume you want to take over? [17:08:57] turns out the installer system had the old dns cached for the old ip [17:09:19] so waited an hour, it expired on installer host (expiring the dns systems ahead of time doesnt seem to help once installer caches the ip info) [17:09:23] and now its installing [17:11:16] robh: yeah. I'm preoccupied atm, but ping me when it's done and I'll have look when I'm no longer afk [17:13:04] no worries [17:13:06] (03CR) 10Ppchelko: [C: 03+1] Use wgEventServiceStreamConfig to configure wgRCFeeds['eventbus'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521528 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [17:14:06] (03Abandoned) 10Ppchelko: Clean up configuration for pdfrender service. [puppet] - 10https://gerrit.wikimedia.org/r/514226 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko) [17:14:12] (03PS5) 10Andrew Bogott: Move all cloud VMs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/521518 (https://phabricator.wikimedia.org/T46722) [17:15:56] (03CR) 10Andrew Bogott: [C: 03+2] Move all cloud VMs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/521518 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [17:26:37] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@2a9d097]: Fix etag generation for the talk endpoint (T227481) [17:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:42] T227481: Talk endpoint returns wrong etag - https://phabricator.wikimedia.org/T227481 [17:30:26] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@2a9d097]: Fix etag generation for the talk endpoint (T227481) (duration: 03m 49s) [17:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:41] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@2a9d097]: Fix etag generation for the talk endpoint, take 2 [17:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:46] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@2a9d097]: Fix etag generation for the talk endpoint, take 2 (duration: 02m 04s) [17:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:17] (03PS3) 10Ayounsi: Add rpkicounter [puppet] - 10https://gerrit.wikimedia.org/r/520337 [17:39:09] (03PS1) 10Ppchelko: Undeploy pdfrender: remove remaining classes. [puppet] - 10https://gerrit.wikimedia.org/r/521550 (https://phabricator.wikimedia.org/T226675) [17:39:44] (03CR) 10Ayounsi: Add rpkicounter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520337 (owner: 10Ayounsi) [17:46:35] (03CR) 10Ppchelko: "Puppet compiler https://puppet-compiler.wmflabs.org/compiler1001/17278/" [puppet] - 10https://gerrit.wikimedia.org/r/521550 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko) [17:47:21] 10Operations, 10Release-Engineering-Team-TODO, 10observability, 10Performance-Team (Radar), and 2 others: Increase "check_legal_html" coverage to group0 wikis - https://phabricator.wikimedia.org/T208284 (10greg) [17:47:26] 10Operations, 10Keyholder, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services): Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10greg) [17:48:43] 10Operations, 10Release-Engineering-Team-TODO, 10observability, 10Release-Engineering-Team (Deployment services): "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520 (10greg) [17:48:52] 10Operations, 10Release-Engineering-Team-TODO, 10Core Platform Team Backlog (Later), 10Release-Engineering-Team (Deployment services), and 2 others: Review new service 'pre-deployment to production' checklist - https://phabricator.wikimedia.org/T141897 (10greg) [17:48:57] 10Operations, 10Contributors-Team, 10Release-Engineering-Team-TODO, 10observability, and 2 others: High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090 (10greg) [17:49:02] 10Operations, 10Release-Engineering-Team-TODO, 10observability, 10Release-Engineering-Team (Deployment services), and 2 others: Tracking: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942 (10greg) [17:51:25] 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10User-Joe: [DRAFT][RfC] Deployment of python applications in production - https://phabricator.wikimedia.org/T180023 (10greg) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190709T1800) [18:09:02] (03PS4) 10Dzahn: uwsgi::app: add notes_url for services using uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/521376 (https://phabricator.wikimedia.org/T197873) [18:12:07] (03CR) 10Dzahn: [C: 03+2] uwsgi::app: add notes_url for services using uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/521376 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [18:16:27] (03PS7) 10Ppchelko: RESTRouter: Add initial Helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [18:16:34] (03PS1) 10Krinkle: Change /w/skin-1.5 symlink to be relative instead of absolute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521562 (https://phabricator.wikimedia.org/T156319) [18:16:38] (03PS2) 10Krinkle: Change /w/skin-1.5 symlink to be relative instead of absolute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521562 [18:19:24] !log cutting the branch for 1.34.0-wmf.13 T220738 [18:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:29] T220738: 1.34.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T220738 [18:20:30] (03PS3) 10Krinkle: Remove /w/skin-1.5 symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521562 (https://phabricator.wikimedia.org/T156319) [18:26:53] (03CR) 10Ottomata: [C: 03+2] Disabling temporarily the 3rd party filter for EL events [puppet] - 10https://gerrit.wikimedia.org/r/521541 (https://phabricator.wikimedia.org/T227150) (owner: 10Nuria) [18:27:00] (03PS2) 10Ottomata: Disabling temporarily the 3rd party filter for EL events [puppet] - 10https://gerrit.wikimedia.org/r/521541 (https://phabricator.wikimedia.org/T227150) (owner: 10Nuria) [18:28:03] (03PS2) 10Dzahn: ipmi: add icinga notes_url for IPMI sensor [puppet] - 10https://gerrit.wikimedia.org/r/521401 [18:28:09] (03PS5) 10CRusnov: netbox: Add parameters and settings for storing things in Swift [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) [18:28:39] (03CR) 10CRusnov: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [18:29:37] (03CR) 10Dzahn: [C: 03+2] ipmi: add icinga notes_url for IPMI sensor [puppet] - 10https://gerrit.wikimedia.org/r/521401 (owner: 10Dzahn) [18:29:54] (03PS3) 10Dzahn: ipmi: add icinga notes_url for IPMI sensor [puppet] - 10https://gerrit.wikimedia.org/r/521401 [18:34:32] (03PS1) 10Paladox: Update its-base plugin submodule [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/521564 [18:34:55] (03CR) 10Volans: [C: 03+1] "LGTM, but I'd like Filippo to have a final look too for the swift part" [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [18:35:01] (03PS1) 10ArielGlenn: add a few more public sql tables to default list to be dumped [dumps] - 10https://gerrit.wikimedia.org/r/521565 (https://phabricator.wikimedia.org/T226167) [18:36:46] 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 4 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Cmjohnson) @eevans We did a test run for an install and the server was able to reach the inst... [18:39:14] (03PS1) 10ArielGlenn: add more public sql tables to xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/521566 (https://phabricator.wikimedia.org/T226167) [18:39:25] (03CR) 10CRusnov: "https://puppet-compiler.wmflabs.org/compiler1002/17281/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [18:41:29] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] "Changes look good to me!" [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/521564 (owner: 10Paladox) [18:48:56] 10Operations, 10Release Pipeline, 10serviceops, 10Core Platform Team (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10Pchelolo) > Regarding the deployment plan, the main pain point is that we will need to... [18:51:56] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:54:15] (03CR) 10Jforrester: [C: 03+1] "What could posibly breal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521562 (https://phabricator.wikimedia.org/T156319) (owner: 10Krinkle) [19:00:04] longma: (Dis)respected human, time to deploy MediaWiki train - American version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190709T1900). Please do the needful. [19:09:02] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [19:12:12] (03PS1) 10Dzahn: ci::master: exclude /mnt/docker from Icinga disk check [puppet] - 10https://gerrit.wikimedia.org/r/521571 (https://phabricator.wikimedia.org/T227605) [19:13:22] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [19:13:36] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [19:13:48] (03CR) 10Dzahn: [C: 03+2] ci::master: exclude /mnt/docker from Icinga disk check [puppet] - 10https://gerrit.wikimedia.org/r/521571 (https://phabricator.wikimedia.org/T227605) (owner: 10Dzahn) [19:13:58] (03PS2) 10Dzahn: ci::master: exclude /mnt/docker from Icinga disk check [puppet] - 10https://gerrit.wikimedia.org/r/521571 (https://phabricator.wikimedia.org/T227605) [19:14:49] !log replace netflow target on cr2-eqiad with netflow1001 [19:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:12] (03PS2) 10Dzahn: gerrit: stop including passwords class in module [puppet] - 10https://gerrit.wikimedia.org/r/511618 [19:17:18] !log enable samping on cr2-eqiad:border-in4 [19:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:55] 10Operations, 10Security-Team: Add Jennifer Cross to security@ alias. - https://phabricator.wikimedia.org/T227609 (10sbassett) [19:25:07] 10Operations, 10Security-Team: Add Jennifer Cross to security@ alias - https://phabricator.wikimedia.org/T227609 (10sbassett) p:05Triage→03Normal [19:25:57] 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 4 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Dzahn) restbase1017 is shown as down in Icinga and has no downtime or comment . would be appr... [19:26:43] ACKNOWLEDGEMENT - Host restbase1017 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T222960 [19:29:23] (03PS3) 10Dzahn: gerrit: stop including passwords class in module [puppet] - 10https://gerrit.wikimedia.org/r/511618 [19:30:16] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/17282/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/511618 (owner: 10Dzahn) [19:34:01] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.34.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521574 [19:34:03] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.34.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521574 (owner: 10Jeena Huneidi) [19:35:43] (03Merged) 10jenkins-bot: testwikis wikis to 1.34.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521574 (owner: 10Jeena Huneidi) [19:36:07] !log jhuneidi@deploy1001 Started scap: testwikis wikis to 1.34.0-wmf.13 [19:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:46] (03CR) 10jenkins-bot: testwikis wikis to 1.34.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521574 (owner: 10Jeena Huneidi) [19:40:34] (03PS4) 10Dzahn: gerrit: stop including passwords class in module [puppet] - 10https://gerrit.wikimedia.org/r/511618 [19:40:54] (03PS1) 10Aklapper: Fix non-working "raw text" links on noc.wikimedia.org web pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521576 (https://phabricator.wikimedia.org/T227606) [19:42:45] (03CR) 10DannyS712: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521576 (https://phabricator.wikimedia.org/T227606) (owner: 10Aklapper) [19:43:05] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17283/" [puppet] - 10https://gerrit.wikimedia.org/r/511618 (owner: 10Dzahn) [19:59:16] (03CR) 10Cwhite: [V: 03+2 C: 03+2] initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [20:01:54] 10Operations, 10Patch-For-Review: decom netmon1003 - https://phabricator.wikimedia.org/T220355 (10Dzahn) 05Open→03Stalled stalled. waiting for a confirmation that servermon the service can be retired (T198939#5104657) [20:01:58] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10Dzahn) [20:02:28] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10Dzahn) gentle ping [20:08:25] PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 63.57, 30.35, 19.60 https://wikitech.wikimedia.org/wiki/Application_servers [20:09:03] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 49.06, 23.75, 15.55 https://wikitech.wikimedia.org/wiki/Application_servers [20:09:33] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 55.93, 25.35, 15.72 https://wikitech.wikimedia.org/wiki/Application_servers [20:09:35] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 49.18, 25.52, 16.25 https://wikitech.wikimedia.org/wiki/Application_servers [20:09:53] RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 25.99, 26.58, 19.27 https://wikitech.wikimedia.org/wiki/Application_servers [20:10:11] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 68.88, 28.89, 18.60 https://wikitech.wikimedia.org/wiki/Application_servers [20:10:29] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 25.26, 23.22, 16.15 https://wikitech.wikimedia.org/wiki/Application_servers [20:10:59] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 20.59, 21.57, 15.28 https://wikitech.wikimedia.org/wiki/Application_servers [20:11:01] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 21.81, 22.52, 16.01 https://wikitech.wikimedia.org/wiki/Application_servers [20:12:46] !log jhuneidi@deploy1001 Finished scap: testwikis wikis to 1.34.0-wmf.13 (duration: 36m 39s) [20:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:55] (03PS2) 10Dzahn: Undeploy pdfrender: remove remaining classes. [puppet] - 10https://gerrit.wikimedia.org/r/521550 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko) [20:13:05] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 19.57, 26.78, 19.84 https://wikitech.wikimedia.org/wiki/Application_servers [20:14:29] (03PS1) 10Cwhite: set up debian packaging [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/521580 (https://phabricator.wikimedia.org/T196066) [20:20:08] 10Operations, 10Analytics, 10netops, 10LDAP: LDAP ldap-ro.eqiad.wikimedia.org not reachable from Analytics VLAN - https://phabricator.wikimedia.org/T227611 (10Ottomata) [20:20:12] (03PS1) 10Jeena Huneidi: group0 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521581 [20:20:14] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521581 (owner: 10Jeena Huneidi) [20:21:08] (03Merged) 10jenkins-bot: group0 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521581 (owner: 10Jeena Huneidi) [20:21:23] (03CR) 10jenkins-bot: group0 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521581 (owner: 10Jeena Huneidi) [20:23:03] (03CR) 10Dzahn: [C: 03+2] Undeploy pdfrender: remove remaining classes. [puppet] - 10https://gerrit.wikimedia.org/r/521550 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko) [20:23:17] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.34.0-wmf.13 refs T220738 [20:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:23] T220738: 1.34.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T220738 [20:25:38] 10Operations, 10Analytics, 10netops, 10LDAP: LDAP ldap-ro.eqiad.wikimedia.org not reachable from Analytics VLAN - https://phabricator.wikimedia.org/T227611 (10Ottomata) p:05Triage→03High [20:25:55] !log temp disabling puppet on scb1001 - removing pdfrender classes from scb2001 [20:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:52] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/521536 (https://phabricator.wikimedia.org/T227587) (owner: 10Jbond) [20:34:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, but probably worth a wide PCC run." [puppet] - 10https://gerrit.wikimedia.org/r/521514 (https://phabricator.wikimedia.org/T227587) (owner: 10Jbond) [20:34:32] (03CR) 10Jbond: puppet: update puppet-temini package name on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521536 (https://phabricator.wikimedia.org/T227587) (owner: 10Jbond) [20:35:08] (03PS1) 10Dzahn: remove pdfrender records [dns] - 10https://gerrit.wikimedia.org/r/521582 (https://phabricator.wikimedia.org/T226675) [20:36:28] !log scb2001 - sudo systemctl stop pdfrender (T226675) [20:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:33] T226675: Undeploy electron service from WMF production - https://phabricator.wikimedia.org/T226675 [20:37:41] !log scb1001 - re-activate puppet, run puppet, stop pdfrender service, run puppet again (T226675) [20:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:10] (03PS1) 10Ppchelko: LVS for RESTRouter. [puppet] - 10https://gerrit.wikimedia.org/r/521584 (https://phabricator.wikimedia.org/T223953) [20:39:52] (03CR) 10Ppchelko: [C: 04-1] "Self -1 until the service is actually deployed and IPs assigned." [puppet] - 10https://gerrit.wikimedia.org/r/521584 (https://phabricator.wikimedia.org/T223953) (owner: 10Ppchelko) [20:40:11] (03PS1) 10Andrew Bogott: nova-fullstack: change to use Buster by default [puppet] - 10https://gerrit.wikimedia.org/r/521585 [20:41:18] (03CR) 10Jbond: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/521514 (https://phabricator.wikimedia.org/T227587) (owner: 10Jbond) [20:41:20] (03CR) 10Muehlenhoff: "Looks good, there's an unrelated issue with the package names in buster; puppet-el now has a different name (but we can also do that as se" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521536 (https://phabricator.wikimedia.org/T227587) (owner: 10Jbond) [20:42:09] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: change to use Buster by default [puppet] - 10https://gerrit.wikimedia.org/r/521585 (owner: 10Andrew Bogott) [20:42:35] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/dns/+/521582" [dns] - 10https://gerrit.wikimedia.org/r/521523 (https://phabricator.wikimedia.org/T226675) (owner: 10Alexandros Kosiaris) [20:46:00] longma: yeah, I'd revert, I got [XST83QpAAEoAAF9SB9kAAADM] Caught exception of type Wikimedia\Rdbms\DBConnectionError on mw.o trying to save an edit [20:46:37] (03PS1) 10Muehlenhoff: Switch restbase1017 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/521586 [20:46:59] greg-g: okay, should I make a phab task as well? [20:47:08] longma: yeah [20:48:43] longma: this is the error from my edit save attempt: https://logstash.wikimedia.org/goto/3e27979b3d79bcff5edfe453f7031ecd [20:48:49] so you can use that in the phab task [20:48:54] thanks [20:51:51] 10Operations, 10Security-Team: Add Jennifer Cross to security@ alias - https://phabricator.wikimedia.org/T227609 (10Dzahn) - confirmed jcross-ctr@ exists in Google - confirmed requestor is linked to WMF account (and in security) - confirmed Jennifer Cross (https://phabricator.wikimedia.org/p/Jcross/) is alrea... [20:52:19] 10Operations, 10Security-Team: Add Jennifer Cross to security@ alias - https://phabricator.wikimedia.org/T227609 (10Dzahn) 05Open→03Resolved a:03Dzahn [20:58:14] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.34.0-wmf.11" [20:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:35] 10Operations, 10vm-requests: Site: eqiad/codfw VM for ORES pool counters - https://phabricator.wikimedia.org/T227567 (10MoritzMuehlenhoff) 05Open→03Resolved VMs have been created [21:04:22] (03PS1) 10Urbanecm: Optimalize unoptimalized logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521632 [21:08:25] (03PS1) 10Jeena Huneidi: Revert "group0 wikis to 1.34.0-wmf.13 refs T220738" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521662 [21:09:01] (03CR) 10Volans: [C: 04-1] "One small thing to fix inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [21:10:48] (03CR) 10CRusnov: "> Patch Set 5: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [21:12:25] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "group0 wikis to 1.34.0-wmf.13 refs T220738" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521662 (owner: 10Jeena Huneidi) [21:13:21] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.34.0-wmf.13 refs T220738" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521662 (owner: 10Jeena Huneidi) [21:13:37] (03CR) 10jenkins-bot: Revert "group0 wikis to 1.34.0-wmf.13 refs T220738" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521662 (owner: 10Jeena Huneidi) [21:13:59] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) Next we need to make a decision whether we keep phab1003 as the prod host permanently (why not i guess?) then we deco... [21:15:30] (03PS6) 10CRusnov: netbox: Add parameters and settings for storing things in Swift [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) [21:16:16] 10Operations, 10Security-Team: jalexander should be removed from security@ as his emails are bouncing - https://phabricator.wikimedia.org/T212621 (10Dzahn) >>! In T212621#4853213, @chasemp wrote: >>>! In T212621#4853210, @Dzahn wrote: >> Done. removed jalexander from security@ alias. I do wonder where the rest... [21:16:18] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:17:40] (03CR) 10Volans: puppet: update puppet-temini package name on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521536 (https://phabricator.wikimedia.org/T227587) (owner: 10Jbond) [21:18:39] (03CR) 10CRusnov: "https://puppet-compiler.wmflabs.org/compiler1002/17284/netmon1002.wikimedia.org/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [21:19:04] (03CR) 10Volans: [C: 03+1] "back to LGTM, but I'd like Filippo to have a final look too for the swift part ;)" [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [21:31:04] 10Operations, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, 10Traffic, and 3 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Krinkle) [21:33:07] 10Operations, 10Striker, 10LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048 (10bd808) [21:37:11] (03PS17) 10CRusnov: Add LibreNMS parity check report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [21:40:53] * Krinkle staging on mwdebug1002 [21:53:26] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.13/includes/libs/rdbms/: T226770 / 4c2a58589f2db (duration: 00m 59s) [21:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:32] T226770: 10X increase in DBPerformance warnings on 1.34-wmf.10 - https://phabricator.wikimedia.org/T226770 [21:53:39] (03CR) 10CRusnov: "Latest patchset is Mostly Green." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [21:58:00] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 50.92, 18.20, 11.57 https://wikitech.wikimedia.org/wiki/Application_servers [21:59:26] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 17.04, 15.18, 11.08 https://wikitech.wikimedia.org/wiki/Application_servers [22:03:55] 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 4 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Eevans) >>! In T222960#5318812, @Dzahn wrote: > restbase1017 is shown as down in Icinga and h... [22:09:41] 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 4 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Eevans) >>! In T222960#5318560, @Cmjohnson wrote: > @eevans We did a test run for an install... [22:09:49] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.13/extensions/Collection/includes/CollectionProposals.php: T227407 / 69a30966c (duration: 00m 57s) [22:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:55] T227407: PHP Warning: count(): Parameter must be an array or an object that implements Countable - https://phabricator.wikimedia.org/T227407 [22:10:18] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [22:13:58] (03PS8) 10CRusnov: backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) [22:14:25] (03CR) 10CRusnov: "Thanks for the template advice! THat and other issues should be addressed." (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [22:25:12] (03CR) 10jerkins-bot: [V: 04-1] Add LibreNMS parity check report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [22:26:17] (03PS18) 10CRusnov: Add LibreNMS parity check report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [22:26:45] (03Abandoned) 10Dzahn: ipmi: add Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/511949 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [22:37:36] RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [22:48:49] 10Operations, 10ops-ulsfo: ulsfo: setup ulsfo PDUs - https://phabricator.wikimedia.org/T209101 (10RobH) [22:49:06] (03CR) 10jerkins-bot: [V: 04-1] backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [22:50:43] (03CR) 10CRusnov: "Suddenly prospector seems to be spitting out a ton of new errors." [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [22:51:52] * Krinkle staging on mwdebug1002 [22:52:41] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.13/extensions/SecurePoll/includes/pages/: c7d7a55b8e8d947234a9 / T227620 (duration: 00m 57s) [22:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:46] T227620: PHP Fatal Error from SpecialSecurePoll: Cannot access ResultWrapper::$result - https://phabricator.wikimedia.org/T227620 [22:54:28] 10Operations, 10LDAP, 10Security: Have a check to prevent non-existent accounts from being added to LDAP groups - https://phabricator.wikimedia.org/T201779 (10Legoktm) Bump, this happened again: {T224110}. If someone could document what script is being used to do this, I can look into writing a patch. [22:55:02] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.13/extensions/AbuseFilter/includes/AbuseFilter.php: 0096dff3022 / T227613 (duration: 00m 57s) [22:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:08] T227613: Cannot save on 1.34.0-wmf.13 - "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T227613 [22:55:46] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:55:56] (03PS1) 10Andrew Bogott: sssd: manage /etc/pam.d/common-session to ensure homedir creation [puppet] - 10https://gerrit.wikimedia.org/r/521793 (https://phabricator.wikimedia.org/T227475) [22:56:43] jouncebot: nex [22:56:49] jouncebot: next [22:56:49] In 0 hour(s) and 3 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190709T2300) [22:56:59] James_F: I'm done [22:57:05] Krinkle: Thanks. [22:57:20] (03PS2) 10Andrew Bogott: sssd: manage /etc/pam.d/common-session to ensure homedir creation [puppet] - 10https://gerrit.wikimedia.org/r/521793 (https://phabricator.wikimedia.org/T227475) [22:57:37] (03PS1) 10Andrew Bogott: Revert "nova-fullstack: change to use Buster by default" [puppet] - 10https://gerrit.wikimedia.org/r/521794 [22:58:20] (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova-fullstack: change to use Buster by default" [puppet] - 10https://gerrit.wikimedia.org/r/521794 (owner: 10Andrew Bogott) [23:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190709T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:01:07] (Train is crashing the empty SWAT.) [23:01:29] (03CR) 10Eevans: [C: 03+1] "I can't really speak to the change itself (I guess Stretch is the default, and this removes the exception for Jessie?), but am +1 on upgra" [puppet] - 10https://gerrit.wikimedia.org/r/521586 (owner: 10Muehlenhoff) [23:02:10] I am going to run the train again since there are no patches for SWAT (James_F beat me to the announcement) [23:04:05] (03PS1) 10Jeena Huneidi: group0 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521796 [23:04:07] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521796 (owner: 10Jeena Huneidi) [23:05:11] (03Merged) 10jenkins-bot: group0 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521796 (owner: 10Jeena Huneidi) [23:06:37] (03CR) 10jenkins-bot: group0 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521796 (owner: 10Jeena Huneidi) [23:06:49] !log updating power ports on T209101 and disabling ports not in used (only turning off one side and awaiting any icinga alerts for 15 minutes before touching other side of power) [23:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:54] T209101: ulsfo: setup ulsfo PDUs - https://phabricator.wikimedia.org/T209101 [23:07:17] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.34.0-wmf.13 refs T220738 [23:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:22] T220738: 1.34.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T220738 [23:18:06] longma: Seems good. You should send an update to wikitech-l saying it's back on track, and thanking Aaron and Timo. :-) [23:18:29] I was about thank them in the phab task but email is good too [23:18:55] Thanks for the help! [23:19:50] longma: there were to other regressions I spotted in Logstash as well, but they've been fixed meanwhile by Aaron also, and deployed just now. [23:19:54] two* [23:20:05] so all clear indeed, so far :) [23:20:12] oh great [23:23:08] 10Operations, 10ops-ulsfo: ulsfo: setup ulsfo PDUs - https://phabricator.wikimedia.org/T209101 (10RobH) [23:23:42] 10Operations, 10ops-ulsfo: ulsfo: setup ulsfo PDUs - https://phabricator.wikimedia.org/T209101 (10RobH) 05Open→03Resolved imported all of the power connections into netbox, and the pdu towers have their ports labeled on the PDU software as well, with groups added for outlet control on network devices. [23:25:29] (03CR) 10Dzahn: "how are we going to make LDAP edits now?" [puppet] - 10https://gerrit.wikimedia.org/r/521518 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [23:32:27] (03PS9) 10CRusnov: backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) [23:33:16] (03CR) 10CRusnov: "> Patch Set 8:" [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [23:33:44] (03PS3) 10Andrew Bogott: sssd: manage /etc/pam.d/common-session to ensure homedir creation [puppet] - 10https://gerrit.wikimedia.org/r/521793 (https://phabricator.wikimedia.org/T227475) [23:33:46] (03PS1) 10Andrew Bogott: ldap: use a read-write ldap host in hieradata/eqiad.yaml [puppet] - 10https://gerrit.wikimedia.org/r/521798 [23:36:51] (03PS2) 10Andrew Bogott: ldap: use a read-write ldap host in hieradata/eqiad.yaml [puppet] - 10https://gerrit.wikimedia.org/r/521798 [23:37:46] (03CR) 10Andrew Bogott: [C: 03+2] ldap: use a read-write ldap host in hieradata/eqiad.yaml [puppet] - 10https://gerrit.wikimedia.org/r/521798 (owner: 10Andrew Bogott) [23:57:31] 10Operations, 10LDAP, 10Security: Have a check to prevent non-existent accounts from being added to LDAP groups - https://phabricator.wikimedia.org/T201779 (10Dzahn) >>! In T201779#5319806, @Legoktm wrote: > Bump, this happened again: {T224110}. > > If someone could document what script is being used to do... [23:58:21] 10Operations, 10ops-codfw, 10ops-eqiad: Document PDU models - https://phabricator.wikimedia.org/T227632 (10ayounsi) p:05Triage→03Low [23:59:27] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10Mayakp.wiki)