[00:11:25] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 48 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [00:16:53] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [00:30:23] (03PS1) 10Bstorm: haproxy: make monitoring code optional [puppet] - 10https://gerrit.wikimedia.org/r/519159 (https://phabricator.wikimedia.org/T215531) [00:35:33] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10faidon) Thanks @JobSnijders, appreciate the feedback very much :) Our goal is to reject all invalids everywhere indeed, just progressively so. Separate validator instances per PoP would be ideal I think, but more so for red... [00:42:04] (03PS1) 10Bstorm: toolforge: correct a bunch of the apilb profile [puppet] - 10https://gerrit.wikimedia.org/r/519160 (https://phabricator.wikimedia.org/T215531) [01:25:36] (03PS1) 10DannyS712: Allow bureaucrats to remove sysop and bureaucrat on nycwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519163 (https://phabricator.wikimedia.org/T226591) [01:28:15] (03PS2) 10DannyS712: Allow bureaucrats to remove sysop and bureaucrat on nycwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519163 (https://phabricator.wikimedia.org/T226591) [01:29:11] (03PS3) 10DannyS712: Allow bureaucrats to remove sysop on nycwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519163 (https://phabricator.wikimedia.org/T226591) [03:04:55] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [03:06:53] PROBLEM - Check systemd state on ms-be1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:31:16] (03PS1) 10Bmansurov: Enable reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519167 (https://phabricator.wikimedia.org/T226273) [03:32:07] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:59:13] RECOVERY - Check systemd state on ms-be1029 is OK: OK - running: The system is fully operational [04:52:09] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Provision db1133 in m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519171 (https://phabricator.wikimedia.org/T222682) [04:54:16] (03PS1) 10Marostegui: db1133: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/519172 (https://phabricator.wikimedia.org/T222682) [04:55:02] (03CR) 10Marostegui: [C: 03+2] db1133: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/519172 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:08:03] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10ops-monitoring-bot) [05:25:21] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Provision db1133 in m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519171 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:26:14] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db1133 in m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519171 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:26:33] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db1133 in m5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519171 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:28:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Add db1133 into m5 depooled T222682 (duration: 00m 55s) [05:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:26] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [05:29:21] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Add db1133 into m5 depooled T222682 (duration: 00m 55s) [05:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:52] 10Operations, 10ops-eqiad, 10DBA, 10Goal, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [05:34:27] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes - https://phabricator.wikimedia.org/T226589 (10ema) p:05Triage→03Normal [05:38:11] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 9 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) [05:45:10] PROBLEM - Check systemd state on ms-be2032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:45:22] PROBLEM - Disk space on ms-be2032 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb3 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [05:45:34] PROBLEM - MD RAID on ms-be2032 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [05:45:35] ACKNOWLEDGEMENT - MD RAID on ms-be2032 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226600 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [05:45:39] 10Operations, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T226600 (10ops-monitoring-bot) [05:45:54] PROBLEM - very high load average likely xfs on ms-be2032 is CRITICAL: CRITICAL - load average: 225.74, 246.46, 162.45 https://wikitech.wikimedia.org/wiki/Swift [05:45:56] PROBLEM - swift-container-updater on ms-be2032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [05:46:50] !log wikimedia_editor_tasks_entity_description_exists from s3:testwikidatawiki T226326 [05:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:55] T226326: Drop the `wikimedia_editor_tasks_entity_description_exists` table - https://phabricator.wikimedia.org/T226326 [05:49:11] 10Operations, 10Developer-Advocacy, 10Gerrit, 10serviceops: Remove port 29418 from cloning process - https://phabricator.wikimedia.org/T37611 (10jijiki) p:05Normal→03Low [05:57:56] !log wikimedia_editor_tasks_entity_description_exists from s8:testwikidatawiki T226326 [05:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:01] T226326: Drop the `wikimedia_editor_tasks_entity_description_exists` table - https://phabricator.wikimedia.org/T226326 [05:59:58] !log systemctl mask + reset-failed kafka on kafka10[12-23] - T226517 [06:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:03] T226517: Reclaim/Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 [06:01:30] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:01:44] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [06:04:00] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10ema) >>! In T225998#5284077, @Gilles wrote: > Remember that x-cache headers are read from right to left. More details here: https:/... [06:09:46] RECOVERY - very high load average likely xfs on ms-be2032 is OK: OK - load average: 1.45, 6.42, 45.27 https://wikitech.wikimedia.org/wiki/Swift [06:17:55] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Reclaim/Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10elukey) a:03RobH [06:18:36] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10elukey) [06:24:41] 10Operations, 10DBA, 10MediaWiki-Database, 10Patch-For-Review: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10Marostegui) @jcrespo you ok if I copy dewiki.logging into db1114? I would like to see the... [06:24:47] (03PS2) 10Alexandros Kosiaris: ganeti: Setup buster and a software RAID5 recipe [puppet] - 10https://gerrit.wikimedia.org/r/519075 (https://phabricator.wikimedia.org/T224603) [06:29:32] (03PS2) 10Ema: varnishfetcherror: log BogoHeader [puppet] - 10https://gerrit.wikimedia.org/r/519056 (https://phabricator.wikimedia.org/T226375) [06:30:07] (03CR) 10Ema: [C: 03+2] varnishfetcherror: log BogoHeader [puppet] - 10https://gerrit.wikimedia.org/r/519056 (https://phabricator.wikimedia.org/T226375) (owner: 10Ema) [06:30:34] (03CR) 10Urbanecm: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519163 (https://phabricator.wikimedia.org/T226591) (owner: 10DannyS712) [06:33:08] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/50-prometheus.conf] [06:33:38] PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:37:37] (03CR) 10Muehlenhoff: [C: 03+1] ganeti: Setup buster and a software RAID5 recipe [puppet] - 10https://gerrit.wikimedia.org/r/519075 (https://phabricator.wikimedia.org/T224603) (owner: 10Alexandros Kosiaris) [06:37:41] (03CR) 10Tim Starling: Add a fatal error page to go with the proposed wmerrors feature (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [06:40:44] (03PS4) 10Tim Starling: Add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) [06:41:14] (03CR) 10jerkins-bot: [V: 04-1] Add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [06:45:37] 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10jijiki) [06:48:16] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational [06:50:41] 10Operations, 10DBA, 10MediaWiki-Database, 10Patch-For-Review: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10jcrespo) > @jcrespo you ok if I copy dewiki.logging into db1114 Sure, if you do it in it... [06:52:05] 10Operations, 10DBA, 10MediaWiki-Database, 10Patch-For-Review: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10Marostegui) >>! In T193224#5285040, @jcrespo wrote: >> @jcrespo you ok if I copy dewiki.l... [06:54:52] (03PS1) 10Urbanecm: Tidy up groupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519180 (https://phabricator.wikimedia.org/T185898) [07:00:18] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:00:50] RECOVERY - puppet last run on mc1035 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:04:18] (03PS1) 10Elukey: role::druid::analytics|public::worker: set stricter query timeouts [puppet] - 10https://gerrit.wikimedia.org/r/519181 (https://phabricator.wikimedia.org/T226035) [07:06:49] (03CR) 10Joal: [C: 03+1] "LGTM ! Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/519181 (https://phabricator.wikimedia.org/T226035) (owner: 10Elukey) [07:06:59] (03CR) 10Elukey: [C: 03+2] role::druid::analytics|public::worker: set stricter query timeouts [puppet] - 10https://gerrit.wikimedia.org/r/519181 (https://phabricator.wikimedia.org/T226035) (owner: 10Elukey) [07:07:23] 10Operations, 10DBA, 10MediaWiki-Database, 10Patch-For-Review: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10jcrespo) Ping @Anomie We have temporarily setup db1114 with MariaDB 10.3 and load it with... [07:09:33] !log reboot of druid100[1-3] hosts for kernel + openjdk upgrades [07:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:13] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10MoritzMuehlenhoff) @mmodell Ideally we fix this in Debian so that others can also bene... [07:16:05] (03PS1) 10Ema: cache: reimage cp5003 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519183 (https://phabricator.wikimedia.org/T226477) [07:18:27] 10Operations, 10Wikimedia Australia, 10Wikimedia-Mailing-lists: Wikimedia-au-members and wikimedia-au-announce password reset - https://phabricator.wikimedia.org/T225712 (10Quiddity) 05Open→03Resolved a:03Quiddity This is now resolved. [07:18:50] (03Abandoned) 10Filippo Giunchedi: icinga: increase service_check / command_timeout by 11% [puppet] - 10https://gerrit.wikimedia.org/r/516627 (https://phabricator.wikimedia.org/T210723) (owner: 10Filippo Giunchedi) [07:19:21] !log depool cp5003 and reimage as upload_ats T226477 [07:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:26] T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 [07:20:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/518954 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [07:20:31] (03CR) 10Ema: [C: 03+2] cache: reimage cp5003 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519183 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [07:23:35] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5003.eqsin.wmnet'] ` The log can be found in `... [07:30:15] (03PS1) 10Marostegui: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/519185 (https://phabricator.wikimedia.org/T226358) [07:30:37] !log powercycle ms-be2032 - T226600 [07:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:43] T226600: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T226600 [07:30:50] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:31:37] 10Operations, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T226600 (10fgiunchedi) Unaccessible via ssh ` $ ssh ms-be2032.codfw.wmnet Linux ms-be2032 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u2 (2019-05-13) x86_64 Debian GNU/Linux 9.9 (stretch) ms-be2032 is a statsite server (st... [07:33:02] (03CR) 10Marostegui: "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/compiler1002/17110/" [puppet] - 10https://gerrit.wikimedia.org/r/519185 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [07:33:08] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/519185 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [07:34:04] RECOVERY - swift-container-updater on ms-be2032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [07:34:04] RECOVERY - MD RAID on ms-be2032 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [07:34:13] (03PS1) 10Marostegui: wmnet: Change x1-master to the new master [dns] - 10https://gerrit.wikimedia.org/r/519186 (https://phabricator.wikimedia.org/T226358) [07:34:14] RECOVERY - Check systemd state on ms-be2032 is OK: OK - running: The system is fully operational [07:34:34] RECOVERY - Disk space on ms-be2032 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [07:34:44] (03PS2) 10Marostegui: wmnet: Change x1-master to point to the new master [dns] - 10https://gerrit.wikimedia.org/r/519186 (https://phabricator.wikimedia.org/T226358) [07:35:38] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/519186 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [07:36:54] (03PS1) 10Marostegui: db-eqiad.php: Promote db1120 to x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519187 (https://phabricator.wikimedia.org/T226358) [07:38:06] 10Operations, 10ops-codfw, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10fgiunchedi) [07:40:05] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519187 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [07:40:40] (03PS1) 10Giuseppe Lavagetto: www.wikimedia.org: fix Location directives [puppet] - 10https://gerrit.wikimedia.org/r/519188 (https://phabricator.wikimedia.org/T223835) [07:41:04] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational [07:42:07] 10Operations, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T226600 (10fgiunchedi) The host came back clean after a reboot, I've updated the firmware (cfr T141756) to 6.88 and rebooted again. [07:42:38] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:44:42] (03PS1) 10Ema: cache: add cp3043 back to the text cluster [puppet] - 10https://gerrit.wikimedia.org/r/519189 (https://phabricator.wikimedia.org/T226375) [07:47:09] 10Operations, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T226600 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi [07:50:53] !log bounce rsyslog on lithium - T199406 [07:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:59] T199406: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 [08:01:53] (03CR) 10Ema: [C: 03+1] www.wikimedia.org: fix Location directives [puppet] - 10https://gerrit.wikimedia.org/r/519188 (https://phabricator.wikimedia.org/T223835) (owner: 10Giuseppe Lavagetto) [08:03:41] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10hashar) I had a quick chat this morning with various SRE people. Theoretically disk setup is... [08:08:15] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:17:11] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5003.eqsin.wmnet'] ` and were **ALL** successful. [08:20:14] 10Operations, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10alaa_wmde) Hi there, so apparently I am not in `nda` group yet (as seen here https://tools.wmflabs.org/wmde-access/) for some reason. I signed the NDA on Feb 1 2019, 2:36 PM (... [08:27:14] (03PS1) 10Muehlenhoff: Re-enable TCP selective acknowledgements on hosts running a fixed kernel [puppet] - 10https://gerrit.wikimedia.org/r/519193 (https://phabricator.wikimedia.org/T225998) [08:27:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] www.wikimedia.org: fix Location directives [puppet] - 10https://gerrit.wikimedia.org/r/519188 (https://phabricator.wikimedia.org/T223835) (owner: 10Giuseppe Lavagetto) [08:30:58] !log pool cp5003 w/ ATS backend T226477 [08:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:04] T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 [08:43:54] !log rebooting deployment-mediawiki-07 for new kernel [08:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:46] moritzm: Can I deploy cxserver? [08:50:44] OK. Seems OK then :) [08:52:17] on deployment-prep? sure go ahead [08:52:38] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [08:52:39] !log kartik@deploy1001 scap-helm cxserver cluster staging completed [08:52:39] !log kartik@deploy1001 scap-helm cxserver finished [08:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:51] (03PS2) 10Muehlenhoff: Re-enable TCP selective acknowledgements on hosts running a fixed kernel [puppet] - 10https://gerrit.wikimedia.org/r/519193 (https://phabricator.wikimedia.org/T225998) [08:56:40] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [08:56:41] !log kartik@deploy1001 scap-helm cxserver cluster eqiad completed [08:56:41] !log kartik@deploy1001 scap-helm cxserver finished [08:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:06] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [08:58:07] !log kartik@deploy1001 scap-helm cxserver cluster codfw completed [08:58:07] !log kartik@deploy1001 scap-helm cxserver finished [08:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:53] !log Updated cxserver to 9bad239 (T226482) [09:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:00] T226482: Enable machine translation option for Italian in Content translation - https://phabricator.wikimedia.org/T226482 [09:04:49] !log reboot druid100[4-6] for kernel and openjdk upgrades [09:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:49] (03CR) 10Ema: [C: 03+1] "LGTM and to pcc, although it doesn't seem like because /etc/sysctl.d/70-disable_tcp_sack.conf isn't shown here https://puppet-compiler.wmf" [puppet] - 10https://gerrit.wikimedia.org/r/519193 (https://phabricator.wikimedia.org/T225998) (owner: 10Muehlenhoff) [09:16:17] (03PS1) 10Ema: cache: reimage cp5004 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519198 (https://phabricator.wikimedia.org/T226477) [09:16:42] (03CR) 10Muehlenhoff: [C: 03+1] cache: reimage cp5004 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519198 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [09:17:41] (03PS1) 10Elukey: profile::graphite::alerts: remove unused code [puppet] - 10https://gerrit.wikimedia.org/r/519199 (https://phabricator.wikimedia.org/T226517) [09:18:49] !log depool cp5004 and reimage as upload_ats T226477 [09:18:54] (03CR) 10Elukey: [C: 03+2] profile::graphite::alerts: remove unused code [puppet] - 10https://gerrit.wikimedia.org/r/519199 (https://phabricator.wikimedia.org/T226517) (owner: 10Elukey) [09:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:55] T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 [09:19:15] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519200 (https://phabricator.wikimedia.org/T217396) [09:19:50] (03CR) 10Ema: [C: 03+2] cache: reimage cp5004 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519198 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [09:19:58] (03PS2) 10Ema: cache: reimage cp5004 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519198 (https://phabricator.wikimedia.org/T226477) [09:20:24] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519200 (https://phabricator.wikimedia.org/T217396) (owner: 10Marostegui) [09:21:16] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519200 (https://phabricator.wikimedia.org/T217396) (owner: 10Marostegui) [09:21:30] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519200 (https://phabricator.wikimedia.org/T217396) (owner: 10Marostegui) [09:22:09] RECOVERY - puppet last run on graphite1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:22:46] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db1068 from config T217396 (duration: 01m 11s) [09:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:51] T217396: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 [09:23:44] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5004.eqsin.wmnet'] ` The log can be found in `... [09:23:47] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1068 from config T217396 (duration: 00m 55s) [09:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:49] (03PS7) 10Filippo Giunchedi: dsa-check-hpssacli: make compatible with ssacli [puppet] - 10https://gerrit.wikimedia.org/r/516726 (https://phabricator.wikimedia.org/T220787) (owner: 10Faidon Liambotis) [09:26:57] (03CR) 10Filippo Giunchedi: [C: 03+2] dsa-check-hpssacli: make compatible with ssacli [puppet] - 10https://gerrit.wikimedia.org/r/516726 (https://phabricator.wikimedia.org/T220787) (owner: 10Faidon Liambotis) [09:28:46] (03PS1) 10Mathew.onipe: icinga: fix zero division error for mjolnir bulk update alert [puppet] - 10https://gerrit.wikimedia.org/r/519201 (https://phabricator.wikimedia.org/T225904) [09:30:07] 10Operations: HP Gen9 onboard controller review - https://phabricator.wikimedia.org/T216175 (10fgiunchedi) [09:30:11] 10Operations, 10Icinga, 10observability, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi AFAICT this is good to resolve, please feel free to reopen if that... [09:37:05] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:37:10] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) Adding @MoritzMuehlenhoff since he seems to knows best about the `reprepro` config in `modules/aptrepo/files/upd... [09:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:23] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:42:21] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:49:58] <_joe_> !log restarted php7.2-fpm on mwdebug1002, testing php-check-and-restart script [09:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:42] (03CR) 10Filippo Giunchedi: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519201 (https://phabricator.wikimedia.org/T225904) (owner: 10Mathew.onipe) [10:09:34] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/519193 (https://phabricator.wikimedia.org/T225998) (owner: 10Muehlenhoff) [10:11:41] 10Operations, 10DBA, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10jcrespo) 05Stalled→03Open a:03jcrespo [10:12:33] 10Operations, 10observability, 10Goal, 10User-fgiunchedi: TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220104 (10fgiunchedi) [10:12:35] 10Operations, 10observability, 10Goal, 10User-fgiunchedi: Investigate distributed and long term storage solutions for Prometheus - https://phabricator.wikimedia.org/T213918 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi The document outlining available options with pros/cons and recommendations is h... [10:12:50] 10Operations, 10DNS, 10Matrix, 10Traffic, and 3 others: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Joe) a:03Joe [10:15:16] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:15:55] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5004.eqsin.wmnet'] ` and were **ALL** successful. [10:19:05] 10Operations, 10DNS, 10Matrix, 10Traffic, and 3 others: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Joe) 05Open→03Resolved Using curl I can confirm the header is now added. I fear you might need to force-reload in your browser as I se... [10:21:47] !log pool cp5004 w/ ATS backend T226477 [10:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:53] T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 [10:24:07] (03PS6) 10Alaa Sarhan: Switch property terms migration to WRITE_BOTH on wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517674 (https://phabricator.wikimedia.org/T225051) [10:29:58] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [10:30:51] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [10:33:19] (03PS2) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [10:34:08] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [10:34:34] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) [10:35:02] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) Thanks @MoritzMuehlenhoff! [10:37:14] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:37:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:30] 10Operations, 10observability, 10Goal, 10User-fgiunchedi: Investigate distributed and long term storage solutions for Prometheus - https://phabricator.wikimedia.org/T213918 (10CDanis) +1 for the relative simplicity of Thanos (from both a design and deployment perspective) [10:39:42] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:39:44] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:16] (03CR) 10MarcoAurelio: [C: 03+1] "Patch will allow nycwm 'crats to remove sysop rights locally as requested. Bureaucrat removal will need to go through Meta." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519163 (https://phabricator.wikimedia.org/T226591) (owner: 10DannyS712) [10:51:46] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10jbond) p:05Triage→03Normal [10:52:24] (03CR) 10Urbanecm: [C: 03+1] "> Patch Set 3: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519163 (https://phabricator.wikimedia.org/T226591) (owner: 10DannyS712) [10:52:37] jouncebot, next [10:52:37] In 0 hour(s) and 7 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190626T1100) [10:58:36] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Ladsgroup) Regarding Cognate going read-only, I want to point out to T187960#4998807 (I can run the maintenance script after it's do... [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190626T1100). [11:00:04] alaa_wmde, bmansurov, dcausse, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:09] here [11:00:12] o/ [11:00:12] Hi everyone [11:00:16] o/ [11:00:29] I have one small patch to deploy, and since I'll have to leave in 30 mins, can I please deploy my patch first? [11:00:43] I don't mind. [11:00:49] no objections [11:00:52] okay, thanks! [11:01:03] (03PS4) 10Urbanecm: Allow bureaucrats to remove sysop on nycwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519163 (https://phabricator.wikimedia.org/T226591) (owner: 10DannyS712) [11:01:09] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519163 (https://phabricator.wikimedia.org/T226591) (owner: 10DannyS712) [11:02:08] (03Merged) 10jenkins-bot: Allow bureaucrats to remove sysop on nycwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519163 (https://phabricator.wikimedia.org/T226591) (owner: 10DannyS712) [11:02:25] (03CR) 10jenkins-bot: Allow bureaucrats to remove sysop on nycwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519163 (https://phabricator.wikimedia.org/T226591) (owner: 10DannyS712) [11:03:44] (03PS7) 10Urbanecm: Switch property terms migration to WRITE_BOTH on wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517674 (https://phabricator.wikimedia.org/T225051) (owner: 10Alaa Sarhan) [11:03:51] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517674 (https://phabricator.wikimedia.org/T225051) (owner: 10Alaa Sarhan) [11:03:56] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::jobrunner: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/519205 [11:03:58] (03PS1) 10Giuseppe Lavagetto: profile: introduce lvs_poool_nodes [puppet] - 10https://gerrit.wikimedia.org/r/519206 [11:03:59] alaa_wmde, your patch is next, please stand by [11:04:00] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: add daemon restart cronjob [puppet] - 10https://gerrit.wikimedia.org/r/519207 (https://phabricator.wikimedia.org/T224857) [11:04:02] (03PS1) 10Giuseppe Lavagetto: mediawiki: run the cron for php restarts everywhere [puppet] - 10https://gerrit.wikimedia.org/r/519208 (https://phabricator.wikimedia.org/T224857) [11:04:18] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10fgiunchedi) 05Open→03Resolved All related patches are deployed and indeed we're not experiencing timeouts /... [11:04:28] Amir1: Lucas_WMDE can you please take over testing my patch .. as I don't have production access to do full testing? [11:04:50] (03Merged) 10jenkins-bot: Switch property terms migration to WRITE_BOTH on wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517674 (https://phabricator.wikimedia.org/T225051) (owner: 10Alaa Sarhan) [11:04:53] (03CR) 10jerkins-bot: [V: 04-1] profile: introduce lvs_poool_nodes [puppet] - 10https://gerrit.wikimedia.org/r/519206 (owner: 10Giuseppe Lavagetto) [11:04:56] (03CR) 10MarcoAurelio: "recheck" [debs/file-read-backwards] - 10https://gerrit.wikimedia.org/r/519020 (owner: 10MarcoAurelio) [11:05:04] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::php: add daemon restart cronjob [puppet] - 10https://gerrit.wikimedia.org/r/519207 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [11:05:05] alaa_wmde: AFAIK, it's not testable [11:05:14] We just need to stare at logs [11:05:15] why not? [11:05:15] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:519163|Allow bureaucrats to remove sysop on nycwikimedia]] (T226591) (duration: 00m 57s) [11:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:21] T226591: nyc.wikimedia.org - enable removal of sysop access - https://phabricator.wikimedia.org/T226591 [11:05:34] can't we try on sandbox property? it is linked in the task iirc [11:05:35] oh, no, you're right. I mistook it with my entity batch size change [11:05:49] cool np [11:05:59] I can do it I guess [11:06:03] alaa_wmde, Amir1: It's on mwdebug1002, please test and let me know [11:06:05] (03PS1) 10MarcoAurelio: DNM JENKINS TEST [debs/file-read-backwards] - 10https://gerrit.wikimedia.org/r/519209 [11:06:13] thanks Urbanecm [11:06:19] yw alaa_wmde [11:06:33] Testing [11:06:45] 1:05 PM I can do it I guess [11:06:45] thanks let's test and get it out finally !!! [11:07:15] (03PS2) 10Urbanecm: Enable reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519167 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [11:07:24] bmansurov, your patch is next, please stand by [11:07:30] ok, I'm here [11:07:40] I’m here now, sorry for the delay [11:07:48] (03CR) 10jenkins-bot: Switch property terms migration to WRITE_BOTH on wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517674 (https://phabricator.wikimedia.org/T225051) (owner: 10Alaa Sarhan) [11:08:08] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519167 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [11:09:06] (03Merged) 10jenkins-bot: Enable reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519167 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [11:09:34] I'm on it [11:09:37] Lucas_WMDE: no worries Amir1 is testing my patch already [11:09:40] (03CR) 10jenkins-bot: Enable reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519167 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [11:09:43] ok [11:09:55] bmansurov, your patch is on mwdebug1002 as well, please test and let me know if I can deploy it [11:10:12] (03PS1) 10Alaa Sarhan: Switch Property Terms migration to WRITE_NEW on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519211 (https://phabricator.wikimedia.org/T225053) [11:10:24] Urbanecm: ok, testing, it will take a couple of minutes. [11:10:32] bmansurov, sure, take your time. [11:10:48] Urbanecm: please check mwdebug logs and if there's no error, proceed [11:11:12] https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002 [11:11:17] (03CR) 10Filippo Giunchedi: "LGTM overall, haven't tried building the package yet though." (034 comments) [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519068 (owner: 10Cwhite) [11:11:45] Amir1, don't see anything, deploying [11:11:51] yay [11:11:54] alaa_wmde: Lucas_WMDE ^ [11:12:25] 🎉 awesome! [11:12:31] did you test if anything gets written to the new table when a sandbox property is edited? [11:13:00] bmansurov: might be related to your patch: Another module has already been registered as ext.quicksurveys.survey.reader-demographics-en [11:13:09] looks like there are some rows, yes [11:13:14] seen in logstash [11:13:24] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:517674|Switch property terms migration to WRITE_BOTH on wikidata production]] (T225051) (duration: 00m 56s) [11:13:28] dcausse: hmm, ok i'll look into it [11:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:30] T225051: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_BOTH - https://phabricator.wikimedia.org/T225051 [11:13:41] alaa_wmde, deployed! [11:13:53] (03PS1) 10Alaa Sarhan: Switch property terms migration to WRITE_NEW on production wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519212 (https://phabricator.wikimedia.org/T225053) [11:13:55] great thank you! [11:14:11] dcausse, thanks, as soon as bmansurov completes testing and his patch will be deployed, SWAT will be yours! [11:14:14] (03PS1) 10Volans: dbconfig: honor datacenter scope in config restore [software/conftool] - 10https://gerrit.wikimedia.org/r/519213 [11:14:16] (03PS1) 10Volans: dbconfig: do not commit config if no diff [software/conftool] - 10https://gerrit.wikimedia.org/r/519214 [11:14:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, not tested but trusting it DTRT for upstream and thus for us" [puppet] - 10https://gerrit.wikimedia.org/r/517875 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [11:14:29] alaa_wmde, yw [11:14:29] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10jbond) sudo megacli -LDInfo -Lall -aALL ` Virtual Drive: 4 (Target Id: 4) Name : RAID Level : Primary-0, Secondary-0, RAID Level Qualifier-0 Size : 3.637 TB Sector Siz... [11:15:59] Lucas_WMDE: yes I did, at first I tried it on the snadbox item and the tables were empty (spot the idiot) [11:16:03] Urbanecm: can you abort the deploy? I have to fix the issue reported by dcausse [11:16:09] ^^ [11:16:11] bmansurov, sure, rollbacking [11:16:24] Urbanecm: thanks, I'll stand in line after everyone is done [11:16:57] bmansurov: if the fix is easy please go ahead [11:17:06] dcausse: yeah it's easy [11:17:19] I can wait no problem [11:17:42] !log urbanecm@deploy1001 sync-file aborted: Reverting [[:gerrit:519167]] (T226273) (duration: 00m 32s) [11:17:44] dcausse: should I submit a new patch because the other one has been merged? [11:17:44] okay, aborting the revert :-D [11:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:48] T226273: Demographic Surveys Configurations - https://phabricator.wikimedia.org/T226273 [11:17:50] bmansurov, you must [11:17:56] oh, just the fix [11:17:56] there's no way how to edit a merged patch [11:17:56] sure [11:18:07] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10Volans) @jbond FYI if you want to mimic the automation, just run: ` $ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does not include components in optimal state) name: Adapter... [11:18:26] dcausse, can you please take SWAT over? [11:18:56] Urbanecm: sure [11:19:04] thanks [11:19:07] (03PS2) 10Alaa Sarhan: Switch Property Terms migration to WRITE_NEW on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519211 (https://phabricator.wikimedia.org/T225053) [11:19:16] (03PS2) 10Alaa Sarhan: Switch property terms migration to WRITE_NEW on production wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519212 (https://phabricator.wikimedia.org/T225053) [11:19:53] bmansurov: now it's reverted you have to revert the revert and ammend your fix [11:20:07] dcausse: ok [11:21:04] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10jbond) Thanks @Volans perhaps the [https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Dell_Hardware_Raid_Information_Gathering runbook] should be updated. im not sure if th... [11:22:07] 10Operations, 10observability: Icinga custom checks should follow our HTTP User-Agent policy - https://phabricator.wikimedia.org/T226508 (10jbond) p:05Triage→03Normal [11:22:23] dcausse: I don't see the reverted patch. It's not in master. [11:22:30] bmansurov: me neither... sorry for the confusion [11:22:46] np, I'll just submit a fix on top of the previous patch then? [11:22:55] yes [11:22:59] ok [11:23:53] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10Volans) I'll let them reply :) we have also an hpssacli version of kinda the same script fwiw. [11:24:44] (03PS1) 10Bmansurov: QuickSurveys: rename some surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519215 (https://phabricator.wikimedia.org/T226273) [11:24:49] dcausse: here ^ is the patch, can you deploy it? [11:24:53] sure [11:26:18] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519215 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [11:27:17] (03Merged) 10jenkins-bot: QuickSurveys: rename some surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519215 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [11:27:31] (03CR) 10jenkins-bot: QuickSurveys: rename some surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519215 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [11:28:15] bmansurov: it's live on mwdebug1002 [11:28:23] dcausse: ok, testing [11:33:43] dcausse: still testing, couple more minitues [11:33:48] np [11:35:19] (03CR) 10Filippo Giunchedi: "LGTM overall, I'm wondering what we would use to disable this feature, I'm thinking sth along the lines of thumbnail_expiry negative in th" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [11:37:59] dcausse: please deploy everywhere [11:38:05] deploying [11:40:08] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T226273: Enable reader demographics surveys (duration: 00m 55s) [11:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:14] T226273: Demographic Surveys Configurations - https://phabricator.wikimedia.org/T226273 [11:40:17] bmansurov: ^ [11:40:41] dcausse: thanks! [11:40:45] yw! [11:41:04] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:41:05] Urbanecm: thanks! [11:41:16] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:41:48] (03CR) 10CDanis: [C: 03+2] dbconfig: remove hostname from the instance schema [software/conftool] - 10https://gerrit.wikimedia.org/r/519155 (owner: 10Volans) [11:42:42] (03PS4) 10DCausse: [cirrus] remove unused wgCirrusSearchRequestEventSampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513982 [11:42:44] (03CR) 10CDanis: [C: 03+2] dbconfig: do not commit config if no diff [software/conftool] - 10https://gerrit.wikimedia.org/r/519214 (owner: 10Volans) [11:43:20] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513982 (owner: 10DCausse) [11:44:18] (03Merged) 10jenkins-bot: [cirrus] remove unused wgCirrusSearchRequestEventSampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513982 (owner: 10DCausse) [11:44:23] (03Merged) 10jenkins-bot: dbconfig: remove hostname from the instance schema [software/conftool] - 10https://gerrit.wikimedia.org/r/519155 (owner: 10Volans) [11:44:39] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) 05Open→03Stalled p:05Normal→03High Any news? Anyone able to help with this? [11:46:16] (03CR) 10jenkins-bot: [cirrus] remove unused wgCirrusSearchRequestEventSampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513982 (owner: 10DCausse) [11:47:33] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [cirrus] remove unused wgCirrusSearchRequestEventSampling (duration: 00m 54s) [11:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:20] (03PS5) 10DCausse: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512195 [11:50:01] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512195 (owner: 10DCausse) [11:50:56] (03Merged) 10jenkins-bot: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512195 (owner: 10DCausse) [11:51:10] (03CR) 10jenkins-bot: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512195 (owner: 10DCausse) [11:52:49] (03PS1) 10Bmansurov: Undeploy reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519216 (https://phabricator.wikimedia.org/T226273) [11:53:21] (03CR) 10CDanis: [C: 03+2] dbconfig: unify MediaWiki objects into one [software/conftool] - 10https://gerrit.wikimedia.org/r/519156 (owner: 10Volans) [11:55:06] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group0 (duration: 00m 56s) [11:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:41] (03PS2) 10Lucas Werkmeister (WMDE): Set EntityUsageTable addUsage batch size to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518952 (https://phabricator.wikimedia.org/T225500) (owner: 10Ladsgroup) [11:55:53] (03Merged) 10jenkins-bot: dbconfig: unify MediaWiki objects into one [software/conftool] - 10https://gerrit.wikimedia.org/r/519156 (owner: 10Volans) [11:56:22] (03CR) 10CDanis: [C: 03+2] dbconfig: honor datacenter scope in config restore [software/conftool] - 10https://gerrit.wikimedia.org/r/519213 (owner: 10Volans) [11:57:28] I need couple more minutes for EU swat [11:58:35] jouncebot: next [11:58:36] In 4 hour(s) and 1 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190626T1600) [11:58:40] plenty of time :) [11:58:51] (03Merged) 10jenkins-bot: dbconfig: honor datacenter scope in config restore [software/conftool] - 10https://gerrit.wikimedia.org/r/519213 (owner: 10Volans) [11:58:56] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10jcrespo) This is blocked on @CRoslof or someone else from legal. Last thing he said: > but we are still evaluating the implications of doing so [11:58:58] (03Merged) 10jenkins-bot: dbconfig: do not commit config if no diff [software/conftool] - 10https://gerrit.wikimedia.org/r/519214 (owner: 10Volans) [12:01:14] !log dcausse@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/CirrusSearch/includes/RequestLogger.php: T226568: Convert array params to string when logging requests (duration: 00m 56s) [12:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:24] T226568: PHP error from CirrusSearch/RequestLogger: "Array to string conversion" - https://phabricator.wikimedia.org/T226568 [12:02:25] !log EU swat done [12:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:43] dcausse: I need to deploy things :P [12:02:48] oops :) [12:02:56] I deploy it on my own, it's fine :D [12:02:57] !log Revert: EU swat done [12:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:04] Amir1: ok :) [12:03:49] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518952 (https://phabricator.wikimedia.org/T225500) (owner: 10Ladsgroup) [12:04:46] (03Merged) 10jenkins-bot: Set EntityUsageTable addUsage batch size to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518952 (https://phabricator.wikimedia.org/T225500) (owner: 10Ladsgroup) [12:06:36] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:518952|Set EntityUsageTable addUsage batch size to 100 (T225500)]] (duration: 00m 56s) [12:06:39] (03CR) 10jenkins-bot: Set EntityUsageTable addUsage batch size to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518952 (https://phabricator.wikimedia.org/T225500) (owner: 10Ladsgroup) [12:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:43] T225500: Decrease EntityUsageTable addUsage batch size to 100 - https://phabricator.wikimedia.org/T225500 [12:06:44] 10Operations, 10ops-codfw, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) a:03jcrespo [12:07:48] !log start of ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --batch-size=100 --sleep=3 [12:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:28] (03PS1) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) [12:11:57] (03PS1) 10Ema: cache: reimage cp5005 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519219 (https://phabricator.wikimedia.org/T226477) [12:12:36] (03CR) 10Giuseppe Lavagetto: "LGTM: https://puppet-compiler.wmflabs.org/compiler1002/17113/mw1300.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/519205 (owner: 10Giuseppe Lavagetto) [12:12:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::mediawiki::jobrunner: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/519205 (owner: 10Giuseppe Lavagetto) [12:20:30] (03CR) 10CDanis: "one comment that applies to all three files" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [12:23:00] (03PS3) 10Muehlenhoff: Re-enable TCP selective acknowledgements on hosts running a fixed kernel [puppet] - 10https://gerrit.wikimedia.org/r/519193 (https://phabricator.wikimedia.org/T225998) [12:25:25] !log depool cp5005 and reimage as upload_ats T226477 [12:25:30] 10Operations, 10Wikimedia-Mailing-lists: Request mailing list Chad - https://phabricator.wikimedia.org/T225240 (10jbond) 05Open→03Resolved a:03jbond Hello @Abdallahbigboy I have now created the [[https://lists.wikimedia.org/mailman/listinfo/wikimedia-tchad | Wikimedia-Tchad list]] and you should have r... [12:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:31] T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 [12:25:43] (03PS2) 10Ema: cache: reimage cp5005 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519219 (https://phabricator.wikimedia.org/T226477) [12:26:21] (03CR) 10Ema: [C: 03+2] cache: reimage cp5005 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519219 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [12:27:05] !log end of ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --batch-size=100 --sleep=3 (T225052) [12:27:14] !log EU SWAT is done for real [12:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:18] T225052: Run Property Terms Rebuild script - https://phabricator.wikimedia.org/T225052 [12:27:19] (03PS2) 10CDanis: dbctl: 'instance pool' now uses past percentage, instead of 100 [software/conftool] - 10https://gerrit.wikimedia.org/r/519129 [12:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:28] (03PS2) 10Giuseppe Lavagetto: profile: introduce lvs_poool_nodes [puppet] - 10https://gerrit.wikimedia.org/r/519206 [12:27:36] (03CR) 10CDanis: dbctl: 'instance pool' now uses past percentage, instead of 100 (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/519129 (owner: 10CDanis) [12:29:07] (03PS3) 10CDanis: dbctl: 'instance pool' now uses past percentage, instead of 100 [software/conftool] - 10https://gerrit.wikimedia.org/r/519129 [12:30:12] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5005.eqsin.wmnet'] ` The log can be found in `... [12:31:06] (03PS4) 10CDanis: dbctl: 'instance pool' now uses past percentage, instead of 100 [software/conftool] - 10https://gerrit.wikimedia.org/r/519129 [12:34:19] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10WMDE-Fisch) [12:34:21] (03PS4) 10Muehlenhoff: Re-enable TCP selective acknowledgements on hosts running a fixed kernel [puppet] - 10https://gerrit.wikimedia.org/r/519193 (https://phabricator.wikimedia.org/T225998) [12:36:17] 10Operations, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10jbond) @alaa_wmde NDA permissions in genral for wmde staff is being discussed on a different ticket https://phabricator.wikimedia.org/T225004. however as @RStallman-legalteam... [12:36:23] (03CR) 10Muehlenhoff: [C: 03+2] Re-enable TCP selective acknowledgements on hosts running a fixed kernel [puppet] - 10https://gerrit.wikimedia.org/r/519193 (https://phabricator.wikimedia.org/T225998) (owner: 10Muehlenhoff) [12:37:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17116/ this is a noop." [puppet] - 10https://gerrit.wikimedia.org/r/519206 (owner: 10Giuseppe Lavagetto) [12:37:14] (03PS3) 10Giuseppe Lavagetto: profile: introduce lvs_poool_nodes [puppet] - 10https://gerrit.wikimedia.org/r/519206 [12:37:35] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10jbond) just adding this link as its useful for seeing how the current permissions are and validating after action has been preformed https://to... [12:40:52] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5005.eqsin.wmnet'] ` The log can be found in `... [12:41:20] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: add daemon restart cronjob [puppet] - 10https://gerrit.wikimedia.org/r/519207 (https://phabricator.wikimedia.org/T224857) [12:43:12] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:44:04] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:44:30] (03CR) 10Volans: [C: 03+2] dbctl: 'instance pool' now uses past percentage, instead of 100 [software/conftool] - 10https://gerrit.wikimedia.org/r/519129 (owner: 10CDanis) [12:47:06] (03PS2) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) [12:47:09] (03Merged) 10jenkins-bot: dbctl: 'instance pool' now uses past percentage, instead of 100 [software/conftool] - 10https://gerrit.wikimedia.org/r/519129 (owner: 10CDanis) [12:48:13] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:48:39] (03Abandoned) 10CDanis: dbctl: de-generic-ify helper argument names [software/conftool] - 10https://gerrit.wikimedia.org/r/514632 (owner: 10CDanis) [12:49:03] PROBLEM - puppet last run on mw2219 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [12:51:55] (03PS1) 10CDanis: dbctl: de-generic-ify helper argument name for config subsection [software/conftool] - 10https://gerrit.wikimedia.org/r/519221 [12:54:07] RECOVERY - puppet last run on mw2219 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:55:30] (03CR) 10Jbond: "LGTM thanks for the link and explanation" [puppet] - 10https://gerrit.wikimedia.org/r/517875 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [12:55:38] (03CR) 10Jbond: [C: 03+1] Update check_timedatectl to latest version from DSA repository [puppet] - 10https://gerrit.wikimedia.org/r/517875 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [12:55:57] (03CR) 10Elukey: "need to fix some nits about profile::mariadb::misc::eventlogging::sanitization since pcc shows some diff!" [puppet] - 10https://gerrit.wikimedia.org/r/518954 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [13:03:38] (03PS3) 10Giuseppe Lavagetto: mediawiki::php: add daemon restart cronjob [puppet] - 10https://gerrit.wikimedia.org/r/519207 (https://phabricator.wikimedia.org/T224857) [13:04:01] (03PS3) 10Muehlenhoff: Update check_timedatectl to latest version from DSA repository [puppet] - 10https://gerrit.wikimedia.org/r/517875 (https://phabricator.wikimedia.org/T213527) [13:04:48] (03PS1) 10Marostegui: prometheus: Fix some database typos [puppet] - 10https://gerrit.wikimedia.org/r/519223 [13:05:33] (03CR) 10Muehlenhoff: [C: 03+2] Update check_timedatectl to latest version from DSA repository [puppet] - 10https://gerrit.wikimedia.org/r/517875 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [13:07:12] (03PS2) 10Marostegui: prometheus: Fix some database typos [puppet] - 10https://gerrit.wikimedia.org/r/519223 [13:09:39] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:09:41] anyone doing something at the moment? [13:09:45] Amir1: is your maintenance script done? [13:09:50] I’d like to run a different one (shouldn’t take long) [13:10:09] (*something MediaWiki-related, to clarify) [13:10:30] (03PS4) 10Giuseppe Lavagetto: mediawiki::php: add daemon restart cronjob [puppet] - 10https://gerrit.wikimedia.org/r/519207 (https://phabricator.wikimedia.org/T224857) [13:10:36] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2002 - https://phabricator.wikimedia.org/T200210 (10MoritzMuehlenhoff) [13:11:06] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2002 - https://phabricator.wikimedia.org/T200210 (10MoritzMuehlenhoff) a:05fgiunchedi→03RobH This is no longer needed for buster install tests and now good to decommision. [13:13:16] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1001/17120/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/519207 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [13:13:29] (03PS3) 10Marostegui: prometheus: Fix some database typos [puppet] - 10https://gerrit.wikimedia.org/r/519223 [13:16:01] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) @Ladsgroup I believe that last time it wasn't necessary, but I am not 100% sure [13:16:04] !log begin lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/WikibaseQualityConstraints/maintenance/ImportConstraintStatements.php wikidatawiki # T223372 [13:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:10] T223372: Constraint was removed from property but still displayed on Lexeme - https://phabricator.wikimedia.org/T223372 [13:17:03] !log end (success) lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/WikibaseQualityConstraints/maintenance/ImportConstraintStatements.php wikidatawiki # T223372 [13:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:10] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: master: allow connections to the API from any cloud VM [puppet] - 10https://gerrit.wikimedia.org/r/519225 (https://phabricator.wikimedia.org/T215531) [13:18:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: master: allow connections to the API from any cloud VM [puppet] - 10https://gerrit.wikimedia.org/r/519225 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [13:19:36] separate begin/end log might’ve been overkill on that, didn’t take very long ^^ [13:21:36] (03CR) 10Ottomata: "thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/519199 (https://phabricator.wikimedia.org/T226517) (owner: 10Elukey) [13:21:47] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10elukey) Is there a way to stop this check for some hosts? In this case, this is the hadoop testing cluster, all OOW hardware.. [13:22:46] (03PS6) 10Elukey: Replace profile::analytics::systemd_timer with kerberos::systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/518954 (https://phabricator.wikimedia.org/T212259) [13:24:38] hm, though it looks like this caused a visible traffic spike on the s8 master https://w.wiki/5KP [13:24:55] I’ll create a task to make that script a bit more friendly [13:25:19] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:25:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:09] (03CR) 10Elukey: [C: 03+2] Replace profile::analytics::systemd_timer with kerberos::systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/518954 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [13:28:28] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5005.eqsin.wmnet'] ` and were **ALL** successful. [13:29:59] (03PS7) 10Elukey: camus: add support for kerberos [puppet] - 10https://gerrit.wikimedia.org/r/518958 (https://phabricator.wikimedia.org/T212259) [13:31:43] (03CR) 10Elukey: [C: 03+2] camus: add support for kerberos [puppet] - 10https://gerrit.wikimedia.org/r/518958 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [13:31:46] !log rebooting graphite2003 for kernel security update [13:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:41] !log pool cp5005 w/ ATS backend T226477 [13:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:46] T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 [13:33:22] (03CR) 10Jcrespo: "Let's wait for my codfw review. See also:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519223 (owner: 10Marostegui) [13:34:14] (03CR) 10Marostegui: prometheus: Fix some database typos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519223 (owner: 10Marostegui) [13:34:47] (03PS4) 10Marostegui: prometheus: Fix some database typos [puppet] - 10https://gerrit.wikimedia.org/r/519223 [13:37:11] (03PS1) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) [13:39:17] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:39:49] (03PS1) 10Ema: cache: reimage cp5006 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519228 (https://phabricator.wikimedia.org/T226477) [13:40:32] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki::php: add daemon restart cronjob (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519207 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [13:40:38] task created: https://phabricator.wikimedia.org/T226635 [13:40:58] (03Abandoned) 10Effie Mouzeli: WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) (owner: 10Effie Mouzeli) [13:41:42] (03CR) 10Jbond: "./check_graphite.py --url https://graphite-labs.wikimedia.org -T 10 check_threshold 'transformNull(sumSeries(logstash.rate.mediawiki.fatal" [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [13:42:58] (03CR) 10Jcrespo: "es2016:9104 and es2015:91016 are inverted on zarcillo. I think the prometheus one is the one that is incorrect." [puppet] - 10https://gerrit.wikimedia.org/r/519223 (owner: 10Marostegui) [13:43:42] (03CR) 10Jcrespo: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/519223 (owner: 10Marostegui) [13:46:57] (03PS1) 10Volans: dbconfig: add special group alias 'all' [software/conftool] - 10https://gerrit.wikimedia.org/r/519229 [13:47:57] (03PS5) 10Marostegui: prometheus: Fix some database typos [puppet] - 10https://gerrit.wikimedia.org/r/519223 [13:48:00] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Esc3300) I'm trying to figure out what the volume of queries on WQS may be: If I get https://grafana.wikimedia.org/d/000000489/wiki... [13:48:25] (03PS2) 10Herron: kafka2003 move to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/519084 (https://phabricator.wikimedia.org/T225005) [13:48:46] 10Operations, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Dumps-Generation: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001 (10ArielGlenn) I don't know if we'll bring back the tarballs but I do have a stealth project to get the rsyncable directo... [13:48:53] !log push RPKI classification test to cr4-ulsfo - T220669 [13:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:58] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [13:49:22] 10Operations, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Dumps-Generation: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001 (10ArielGlenn) 05Stalled→03Open [13:51:27] (03CR) 10CDanis: [C: 03+2] dbconfig: add special group alias 'all' [software/conftool] - 10https://gerrit.wikimedia.org/r/519229 (owner: 10Volans) [13:51:45] (03CR) 10Herron: [C: 03+2] kafka2003 move to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/519084 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [13:53:04] (03CR) 10Muehlenhoff: [C: 03+1] cache: reimage cp5006 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519228 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [13:53:34] (03PS2) 10Jbond: icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) [13:54:13] (03Merged) 10jenkins-bot: dbconfig: add special group alias 'all' [software/conftool] - 10https://gerrit.wikimedia.org/r/519229 (owner: 10Volans) [13:54:50] (03CR) 10Volans: [C: 03+2] dbctl: de-generic-ify helper argument name for config subsection [software/conftool] - 10https://gerrit.wikimedia.org/r/519221 (owner: 10CDanis) [13:55:05] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:55:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:19] !log rebooting puppetboard* to pick up MDS-enabled qemu and new kernel [13:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:23] (03Merged) 10jenkins-bot: dbctl: de-generic-ify helper argument name for config subsection [software/conftool] - 10https://gerrit.wikimedia.org/r/519221 (owner: 10CDanis) [13:57:39] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ema) [13:57:46] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ema) p:05Triage→03Normal [13:57:54] (03CR) 10Jcrespo: [C: 03+1] "> Patch Set 5: Verified+2" [puppet] - 10https://gerrit.wikimedia.org/r/519223 (owner: 10Marostegui) [13:58:03] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ema) [13:58:10] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ema) p:05Triage→03Normal [14:01:41] !log rebooting graphite1004 for kernel security update [14:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:39] (03CR) 10BBlack: [C: 03+1] cache: add cp3043 back to the text cluster [puppet] - 10https://gerrit.wikimedia.org/r/519189 (https://phabricator.wikimedia.org/T226375) (owner: 10Ema) [14:06:41] (03PS3) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [14:07:30] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [14:08:53] (03PS6) 10Marostegui: prometheus: Fix some database typos [puppet] - 10https://gerrit.wikimedia.org/r/519223 [14:09:35] (03CR) 10Marostegui: "> > Patch Set 5: Verified+2" [puppet] - 10https://gerrit.wikimedia.org/r/519223 (owner: 10Marostegui) [14:09:48] (03PS3) 10Cwhite: branched from tags/v2.0.0 and added debian directory [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519068 [14:09:48] jynus: ^ [14:09:55] (03PS4) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [14:10:47] (03CR) 10Jcrespo: [C: 03+1] prometheus: Fix some database typos [puppet] - 10https://gerrit.wikimedia.org/r/519223 (owner: 10Marostegui) [14:11:07] (03PS7) 10Marostegui: prometheus: Fix some database typos [puppet] - 10https://gerrit.wikimedia.org/r/519223 [14:11:44] (03PS5) 10Giuseppe Lavagetto: mediawiki::php: add daemon restart cronjob [puppet] - 10https://gerrit.wikimedia.org/r/519207 (https://phabricator.wikimedia.org/T224857) [14:11:51] (03CR) 10Marostegui: [C: 03+2] prometheus: Fix some database typos [puppet] - 10https://gerrit.wikimedia.org/r/519223 (owner: 10Marostegui) [14:11:54] moritzm, I'm having trouble accessing graphite via grafana. Could it be related to your reboot? Should it be back online by now? [14:12:42] !log depool cp3043 and convert it from upload to text [14:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:12] (03CR) 10Cwhite: [C: 03+2] branched from tags/v2.0.0 and added debian directory (034 comments) [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519068 (owner: 10Cwhite) [14:13:18] (03PS2) 10Ema: cache: add cp3043 back to the text cluster [puppet] - 10https://gerrit.wikimedia.org/r/519189 (https://phabricator.wikimedia.org/T226375) [14:13:21] halfak: yeah, it's related to the reboot, the graphite hosts are not redundant unfortunately [14:13:37] (03PS27) 10BBlack: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [14:13:43] Gotcha. Should I expect it to come back online in a minute or two? [14:13:46] (03CR) 10Ema: [C: 03+2] cache: add cp3043 back to the text cluster [puppet] - 10https://gerrit.wikimedia.org/r/519189 (https://phabricator.wikimedia.org/T226375) (owner: 10Ema) [14:13:50] ack, the reboots also unveiled an issue, should be recovering soon [14:16:24] !log beginning replacement of kafka2002 with kafka-main2002 T225005 [14:16:25] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [14:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:30] T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 [14:16:52] halfak: should be back [14:16:56] working for me. Thanks moritzm :) [14:19:00] (03PS7) 10Giuseppe Lavagetto: mediawiki::php: add daemon restart cronjob [puppet] - 10https://gerrit.wikimedia.org/r/519207 (https://phabricator.wikimedia.org/T224857) [14:19:03] <_joe_> another rebase war? [14:19:06] <_joe_> .... [14:19:09] 04Critical Alert for device fasw-c-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:19:16] 04Critical Alert for device msw1-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:19:17] <_joe_> ? [14:19:23] 04Critical Alert for device asw-c-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:19:24] (03PS2) 10Jbond: icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519234 (https://phabricator.wikimedia.org/T226508) [14:19:31] 04Critical Alert for device asw-a-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:19:35] war… what is it good for [14:19:36] yikes [14:19:38] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki::php: add daemon restart cronjob [puppet] - 10https://gerrit.wikimedia.org/r/519207 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [14:19:40] could be related to the puppetboard1001 reboot, not sure? [14:19:50] 10Operations, 10Traffic, 10Patch-For-Review: Investigate esams text varnish backend fetch failures - https://phabricator.wikimedia.org/T226375 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3043.esams.wmnet'] ` The log can be found in `/var/log/wm... [14:19:54] but probbably not [14:20:06] <_joe_> I don't think so, no [14:20:09] 04Critical Alert for device fasw-c-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:20:12] (03CR) 10Jcrespo: [C: 03+1] "+1, although with some questions." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/519159 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [14:20:12] <_joe_> XioNoX: ^^ seen those laerts? [14:20:16] 04Critical Alert for device msw1-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:20:24] 04Critical Alert for device asw-c-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:20:31] 04Critical Alert for device asw-a-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:20:32] (03PS3) 10Jbond: icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519234 (https://phabricator.wikimedia.org/T226508) [14:20:38] 04Critical Alert for device asw-b-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:20:43] no urgent, but looking [14:20:45] 04Critical Alert for device asw-d-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:20:52] 04Critical Alert for device cr1-eqsin.wikimedia.org - Device took too long to poll [14:20:52] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10zeljkofilipin) >>! In T226091#5268635, @Aklapper wrote: > **Edit:** I was informed that https://office.wikimedia.org/wiki/Technology/Onboarding exists. Cool! I didn't know about that page. [14:20:54] <_joe_> that's all of codfw [14:20:54] (03CR) 10Jcrespo: [C: 03+1] "> Patch Set 1: Code-Review+1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/519159 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [14:20:55] could be the sign of a different issue [14:21:05] yeah, and eqsin goes through codfw [14:21:08] <_joe_> yeah not saying those are the issue [14:21:10] 04Critical Alert for device asw-c-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:21:17] 04Critical Alert for device asw-a-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:21:24] 04Critical Alert for device asw-b-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:21:25] <_joe_> they seem like a symptom [14:21:31] 04Critical Alert for device asw-d-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:21:36] did we lose a link or something? [14:21:38] 04Critical Alert for device cr1-eqsin.wikimedia.org - Device took too long to poll [14:21:44] what's going on? [14:22:09] <_joe_> XioNoX: you're looking into it? [14:22:09] 04̶C̶r̶i̶t̶i̶c̶a̶l Device fasw-c-codfw.mgmt.codfw.wmnet recovered from Device took too long to poll [14:22:13] yep [14:22:16] 04̶C̶r̶i̶t̶i̶c̶a̶l Device msw1-codfw.mgmt.codfw.wmnet recovered from Device took too long to poll [14:22:21] <_joe_> ok thanks, let us know if you need help [14:22:23] 04Critical Alert for device asw-c-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:22:26] (03CR) 10BBlack: [C: 04-1] "Nits in comments about type naming..." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [14:22:31] 04Critical Alert for device asw-a-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:22:33] <_joe_> I also see ipsec alerts [14:22:38] 04Critical Alert for device asw-b-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:22:39] _joe_: unrelated [14:22:42] <_joe_> which are related to reboots though [14:22:45] 04Critical Alert for device asw-d-codfw.mgmt.codfw.wmnet - Device took too long to poll [14:22:52] 04Critical Alert for device cr1-eqsin.wikimedia.org - Device took too long to poll [14:22:58] (and ack'ed now) [14:23:09] <_joe_> ema: <3 [14:23:09] 04Critical Alert for device cr1-eqsin.wikimedia.org - Device took too long to poll [14:23:17] there is still the Zayo link down between cr2-eqiad and cr2-codfw, but that's been true for a long time now [14:23:44] I'm muting the alert [14:23:58] I would've said codfw mgmt issue, but then how are eqsin routers affected by a codfw-mgmt-only problem? [14:24:12] Yeah, that is weird [14:24:32] Nothing relevant on the ops calendar for any vendor's maintenance, btw [14:24:37] one eqiad/codfw link has been down for 17h [14:25:08] we've had some odd real traffic stats past few minutes as well [14:25:21] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&from=now-1h&to=now [14:25:29] smokeping doesn't seem to show any latency change [14:25:38] notice total dropout of request stats there circa 14:02 -> 14:05 [14:25:49] but that is hopefully just missing stats, it seems to artificial? [14:25:51] *too [14:25:53] I think that was graphite being rebooted, bblack [14:25:54] bblack: that could be graphite reboot [14:26:24] ah right, dumb browser history autocompletes the old one instead of the prometheus one :) [14:26:26] prometheus looks fine I think https://grafana.wikimedia.org/d/000000464/prometheus-varnish-aggregate-client-status-code?orgId=1&from=now-1h&to=now [14:26:28] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&from=now-1h&to=now [14:26:31] vs [14:26:37] https://grafana.wikimedia.org/d/000000464/prometheus-varnish-aggregate-client-status-code?orgId=1 [14:27:03] (03CR) 10Bstorm: "> Patch Set 1:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/519159 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [14:28:06] I can ping ping asw-b-codfw.mgmt.codfw.wmnet from both eqiad and codfw bastions [14:28:13] yeah, smokeping too [14:28:34] https://librenms.wikimedia.org/device/device=95/ show a drop and a spike, but it's only monitoring artefact [14:28:36] maybe librenms issue, or snmp-specific issue? [14:28:48] yeah, that would be my guess so far [14:28:59] now that the links look fine [14:29:19] PROBLEM - Check systemd state on mw1261 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:30:28] weird that it's only codfw/eqsin, as it seems to match a path issue (eg. ulsfo not impacted because it goes through a different path) [14:31:34] could be something going on in the codfw routers' filters that's blocking SNMP traffic accidentally? [14:31:36] could anything have happened to change the effective MTU on those paths? [14:31:42] (for odfw mgmt + forwarding over the link to eqsin?) [14:35:27] (03PS3) 10Jbond: icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) [14:36:00] what's the perceived impact so far? [14:36:42] XioNoX: ^ [14:37:25] paravoid: none except monitoring glitch on librenms [14:37:26] it doesn't seem like there's any impact outside of librenms being annoyed, yet [14:37:28] I haven't found anything, smokeping and traffic stats looked fine at a glance [14:37:49] is there a theory on why this started alerting just now? [14:37:56] I don't have one [14:38:22] although obviously if there was any network device config change anywhere shortly before, that would be highly suspect. [14:38:35] (03PS1) 10Techguru: Increase capacity of CloudVPS [puppet] - 10https://gerrit.wikimedia.org/r/519237 (https://phabricator.wikimedia.org/T226632) [14:39:20] (03PS2) 10Bstorm: haproxy: make monitoring code optional [puppet] - 10https://gerrit.wikimedia.org/r/519159 (https://phabricator.wikimedia.org/T215531) [14:39:27] XioNoX: ^ [14:39:48] also if you're looking into this, please communicate what you're looking into specifically [14:40:13] (that is to everyone) [14:40:34] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:42:56] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10awight) Confirmed that the old version of wikidiff2 will fall back to not computing moved lines with more than 30 moves: htt... [14:43:06] puppetdb critical by way of netbox, which also runs on netmon1002? [14:43:09] (03PS3) 10Bstorm: haproxy: make monitoring code optional [puppet] - 10https://gerrit.wikimedia.org/r/519159 (https://phabricator.wikimedia.org/T215531) [14:43:12] herron: unrelated [14:43:24] herron: it's the puppetdb report, data error [14:44:02] not so far [14:44:02] https://librenms.wikimedia.org/device/device=95/tab=graphs/group=poller/ [14:44:23] (03CR) 10Alexandros Kosiaris: "> The patch mentions buster, but doesn't enable Buster as the OS? If we want to mix stretch and buster we'll need to deploy T210289 first." [puppet] - 10https://gerrit.wikimedia.org/r/519075 (https://phabricator.wikimedia.org/T224603) (owner: 10Alexandros Kosiaris) [14:44:46] XioNoX: does that mean no theory so far or...? [14:45:01] what caused the latency drop ~17h ago? [14:45:11] (in the graphs linked just above) [14:45:17] 14:24 < XioNoX> one eqiad/codfw link has been down for 17h [14:45:20] paravoid: no theory so far, seems like for some reason snmp failed 1 pull from the devices [14:45:25] oh ok [14:45:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] ganeti: Setup buster and a software RAID5 recipe [puppet] - 10https://gerrit.wikimedia.org/r/519075 (https://phabricator.wikimedia.org/T224603) (owner: 10Alexandros Kosiaris) [14:45:41] maybe the link down is related to the latency drop? [14:45:43] does that mean we don't prefer our lowest-latency link? :) [14:45:44] bblack: the main zayo link failed, which is different from the backup telia link that failed yesterday I think [14:45:59] (03CR) 10Bstorm: [C: 03+2] haproxy: make monitoring code optional [puppet] - 10https://gerrit.wikimedia.org/r/519159 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [14:46:56] current status for the Zayo link is "Current Update: OSP construction crews continue to dig out the duct for repairs. " no TEA [14:46:58] ETA [14:47:20] (03CR) 10Techguru: "Inviting you to review change to the CloudVPS scheduled pool" [puppet] - 10https://gerrit.wikimedia.org/r/519237 (https://phabricator.wikimedia.org/T226632) (owner: 10Techguru) [14:47:47] (03CR) 10Bstorm: [C: 03+2] haproxy: make monitoring code optional [puppet] - 10https://gerrit.wikimedia.org/r/519159 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [14:47:59] (03PS1) 10Giuseppe Lavagetto: php-restarts: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/519239 [14:48:48] (03PS3) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) [14:48:59] XioNoX: paravoid: I don't know what this means but there's a BGP change on cr1-eqsin that correlates with the time of the blips https://librenms.wikimedia.org/graphs/type=bgp_updates/id=4160/to=1561560300/from=1561538700/afi=ipv4/safi=unicast/ [14:49:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] php-restarts: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/519239 (owner: 10Giuseppe Lavagetto) [14:49:16] (03CR) 10Bstorm: "> Patch Set 1:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/519159 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [14:49:22] (03PS2) 10Giuseppe Lavagetto: php-restarts: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/519239 [14:49:24] (03CR) 10jerkins-bot: [V: 04-1] icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [14:49:34] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] php-restarts: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/519239 (owner: 10Giuseppe Lavagetto) [14:50:18] (03CR) 10Bstorm: "Also tested with the compiler to make sure it wouldn't do anything crazy: https://puppet-compiler.wmflabs.org/compiler1002/17123/dbproxy10" [puppet] - 10https://gerrit.wikimedia.org/r/519159 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [14:50:20] oh also, I didn't realize until just now that the SNMP issues are gone (because icinga was silenced for the recovery I guess) [14:50:44] cdanis: that is a good find [14:50:56] (03PS2) 10Bstorm: toolforge: correct a bunch of the apilb profile [puppet] - 10https://gerrit.wikimedia.org/r/519160 (https://phabricator.wikimedia.org/T215531) [14:51:09] cdanis: which I cannot explain [14:51:22] so the SNMP issue was brief, while I was thinking it mysteriously stayed broken heh [14:51:30] no peers flapped, this was probably a storm of route updates or something [14:52:08] can we figure out whether it was external to us or not from that? [14:52:39] looking at other update graphs there were possibly-increased-from-a-noisy-baseline BGP updates around 14:20-14:30 [14:53:03] RECOVERY - Check systemd state on mw1261 is OK: OK - running: The system is fully operational [14:53:41] I wonder how much of that is... polling errors [14:53:53] because the error was SNMP poll errors right [14:54:26] 10Operations, 10Traffic, 10Patch-For-Review: Investigate esams text varnish backend fetch failures - https://phabricator.wikimedia.org/T226375 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3043.esams.wmnet'] ` and were **ALL** successful. [14:54:30] yes I think so [14:54:34] bblack: yeah, I disabled the alert to not make it spam, but yeah it was a 1 time (1 pool) event [14:54:40] (03PS4) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) [14:55:36] ok [14:56:39] paravoid: I think those BGP update graphs might be 'real' and not monitoring/scraping artifacts; other graphs show 0 values for the time of bad scrapes but these don't [14:57:57] (03PS2) 10Giuseppe Lavagetto: mediawiki: run the cron for php restarts everywhere [puppet] - 10https://gerrit.wikimedia.org/r/519208 (https://phabricator.wikimedia.org/T224857) [14:58:05] 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, and 2 others: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10DStrine) [14:59:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: run the cron for php restarts everywhere [puppet] - 10https://gerrit.wikimedia.org/r/519208 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [14:59:46] (03CR) 10Andrew Bogott: [C: 03+2] "Thanks! This was probably just an oversight since the inline comments say that it's scheduled." [puppet] - 10https://gerrit.wikimedia.org/r/519237 (https://phabricator.wikimedia.org/T226632) (owner: 10Techguru) [14:59:54] (03PS2) 10Andrew Bogott: Increase capacity of CloudVPS [puppet] - 10https://gerrit.wikimedia.org/r/519237 (https://phabricator.wikimedia.org/T226632) (owner: 10Techguru) [14:59:57] 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, and 2 others: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10DStrine) I added SRE access and analytics tags as I think one of them is appropriate for this task. Please let us know what... [15:00:23] if storm of route updates, nothing big enough to trigger our max prefix limit. But no way to know their content [15:01:04] !log pool cp3043 as cache_text [15:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:16] ok [15:02:22] also we don't allow peers to send us our own prefixes (or private ones) so it should not impact anything internally [15:02:45] (03PS1) 10Jbond: icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519240 (https://phabricator.wikimedia.org/T226508) [15:03:11] XioNoX et al: I've got one cp5 host reimage left for today. OK to proceed or should I postpone? [15:03:30] ema: yup, good [15:03:50] (03CR) 10Jbond: "sorry about the PS history on this, change got mixed up with https://gerrit.wikimedia.org/r/c/operations/puppet/+/519240. please just pay" [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [15:04:45] XioNoX: ack thanks [15:04:47] !log depool cp5006 and reimage as upload_ats T226477 [15:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:54] T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 [15:05:22] (03PS3) 10Herron: kafka-main: replace kafka2002 hardware with kafka-main2002 [puppet] - 10https://gerrit.wikimedia.org/r/519130 (https://phabricator.wikimedia.org/T225005) [15:06:06] (03PS2) 10Ema: cache: reimage cp5006 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519228 (https://phabricator.wikimedia.org/T226477) [15:06:21] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:06:48] (03CR) 10Ema: [C: 03+2] cache: reimage cp5006 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519228 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [15:07:31] (03PS4) 10Jbond: icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519234 (https://phabricator.wikimedia.org/T226508) [15:07:55] (03CR) 10Herron: [C: 03+2] kafka-main: replace kafka2002 hardware with kafka-main2002 [puppet] - 10https://gerrit.wikimedia.org/r/519130 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [15:08:03] (03PS4) 10Herron: kafka-main: replace kafka2002 hardware with kafka-main2002 [puppet] - 10https://gerrit.wikimedia.org/r/519130 (https://phabricator.wikimedia.org/T225005) [15:09:39] (03PS2) 10Mathew.onipe: icinga: fix zero division error for mjolnir bulk update alert [puppet] - 10https://gerrit.wikimedia.org/r/519201 (https://phabricator.wikimedia.org/T225904) [15:10:01] (03PS3) 10Bstorm: toolforge: correct a bunch of the apilb profile [puppet] - 10https://gerrit.wikimedia.org/r/519160 (https://phabricator.wikimedia.org/T215531) [15:10:07] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:10:16] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5006.eqsin.wmnet'] ` The log can be found in `... [15:11:26] (03CR) 10Bstorm: "This won't work without this patch, so I'm merging it. :)" [puppet] - 10https://gerrit.wikimedia.org/r/519160 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:11:27] (03CR) 10Bstorm: [C: 03+2] toolforge: correct a bunch of the apilb profile [puppet] - 10https://gerrit.wikimedia.org/r/519160 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:11:33] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:12:03] (03CR) 10Nuria: [C: 03+1] role::druid::analytics|public::worker: set stricter query timeouts [puppet] - 10https://gerrit.wikimedia.org/r/519181 (https://phabricator.wikimedia.org/T226035) (owner: 10Elukey) [15:12:59] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:12:59] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [15:14:17] (03CR) 10CDanis: [C: 03+1] icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519240 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [15:15:18] (03CR) 10CDanis: [C: 03+1] icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519234 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [15:16:39] (03CR) 10Bstorm: "Hrm, this one'll need some changes to work now, thanks to my meddling." [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [15:17:54] (03CR) 10CDanis: icinga user agent: add custom user agent to icing checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [15:20:25] (03PS1) 10Gergő Tisza: Enable sending JS errors to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519243 (https://phabricator.wikimedia.org/T217142) [15:21:18] (03PS4) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) [15:21:21] (03CR) 10jerkins-bot: [V: 04-1] icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [15:23:23] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Traffic, 10Performance-Team (Radar): Increase EventLogging limit from 2K to 5K - https://phabricator.wikimedia.org/T208282 (10Ottomata) 05Open→03Declined Modern Event Platform's EventGate will support larger events in POST bodies. [15:23:25] (03PS5) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) [15:24:42] (03CR) 10Jbond: icinga user agent: add custom user agent to icing checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519227 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [15:26:31] (03PS2) 10Kosta Harlan: GrowthExperiments: Enable mobile homepage for cswiki and kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518779 (https://phabricator.wikimedia.org/T225676) [15:27:11] (03PS5) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) [15:30:52] 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, 10Wikimedia-Fundraising: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10RobH) [15:31:12] (03PS1) 10CRusnov: decommission: Add Netbox state change [cookbooks] - 10https://gerrit.wikimedia.org/r/519244 [15:32:18] 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, 10Wikimedia-Fundraising: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10RobH) LDAP requests are in #ldap-access-requests not #sre-access-requests (which is for shell), so just cleanin... [15:33:28] 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, 10Wikimedia-Fundraising: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10Nuria) The only thing needed for turnilo access is to be in nda group (if Camile is an employee she can be adde... [15:36:14] (03PS4) 10Cwhite: branched from tags/v2.0.0 and added debian directory [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519068 [15:40:51] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) > I'd say to deploy the two policies to all routers, even if unused (because e.g. they're not peering routers) - after initial testing that is. Yup, that's the plan, to have all routers similar. > Maybe deploy it on... [15:40:54] (03PS1) 10RobH: update dhcp file for ganeti400[123] [puppet] - 10https://gerrit.wikimedia.org/r/519245 (https://phabricator.wikimedia.org/T226444) [15:41:43] (03CR) 10Mathew.onipe: icinga: fix zero division error for mjolnir bulk update alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519201 (https://phabricator.wikimedia.org/T225904) (owner: 10Mathew.onipe) [15:41:53] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10JobSnijders) Try the following: ` 'members 0x4300:0.0.0.0:2' ` This is a documentation bug on juniper's website. It has been reported to them already. [15:42:17] (03CR) 10RobH: [C: 03+2] update dhcp file for ganeti400[123] [puppet] - 10https://gerrit.wikimedia.org/r/519245 (https://phabricator.wikimedia.org/T226444) (owner: 10RobH) [15:42:25] (03PS2) 10RobH: update dhcp file for ganeti400[123] [puppet] - 10https://gerrit.wikimedia.org/r/519245 (https://phabricator.wikimedia.org/T226444) [15:42:31] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:46:55] !log ppchelko@deploy1001 Started deploy [restbase/deploy@995bc9d]: Use new projects and new config layout T220855, canaries only [15:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:00] T220855: Split the RESTBase execution paths - https://phabricator.wikimedia.org/T220855 [15:50:26] !log ppchelko@deploy1001 deploy aborted: Use new projects and new config layout T220855, canaries only (duration: 03m 31s) [15:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:08] !log ppchelko@deploy1001 Started deploy [restbase/deploy@574a678]: Revert [15:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:47] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:53:04] (03CR) 10Isaac Johnson: "See comment about disabling quicksurveys for wikis without other surveys." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519216 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [15:54:16] ^ cpt deploye [15:54:45] deploy* [15:54:49] gosh... [15:54:55] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@574a678]: Revert (duration: 03m 47s) [15:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:45] PROBLEM - Check systemd state on puppetboard2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:56:01] PROBLEM - DPKG on puppetboard2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:56:04] !log Depooling restbase1007 [15:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:19] PROBLEM - puppet last run on puppetboard2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[uwsgi] [15:58:05] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10MelchiorAelmans) Thanks @JobSnijders for bringing this to my attention. I've raised this with the documentation team and also added a comment to the SR. Should be fixed soon. Indeed this should be configured as: set policy-... [16:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190626T1600). [16:00:04] Urbanecm and kostajh: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:13] Hi kostajh, I can SWAT today! [16:00:15] 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, 10Wikimedia-Fundraising: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10jrobell) thank you for clarifying that @Nuria . I can confirm that Camille is a staff number with a req number.... [16:00:18] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Ladsgroup) >>! In T226358#5286014, @Marostegui wrote: > @Ladsgroup I believe that last time it wasn't necessary, but I am not 100% s... [16:00:20] sounds good Urbanecm [16:00:29] milimetric: got time for a quick review of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/519243 so it can be SWATted? [16:01:11] or ottomata [16:01:13] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518779 (https://phabricator.wikimedia.org/T225676) (owner: 10Kosta Harlan) [16:01:27] (03CR) 10Bmansurov: Undeploy reader demographics surveys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519216 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [16:02:00] (03Merged) 10jenkins-bot: GrowthExperiments: Enable mobile homepage for cswiki and kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518779 (https://phabricator.wikimedia.org/T225676) (owner: 10Kosta Harlan) [16:02:11] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) >>! In T226358#5286495, @Ladsgroup wrote: >>>! In T226358#5286014, @Marostegui wrote: >> @Ladsgroup I believe that last... [16:02:17] (03CR) 10jenkins-bot: GrowthExperiments: Enable mobile homepage for cswiki and kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518779 (https://phabricator.wikimedia.org/T225676) (owner: 10Kosta Harlan) [16:02:19] (03CR) 10Milimetric: [C: 03+2] Enable sending JS errors to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519243 (https://phabricator.wikimedia.org/T217142) (owner: 10Gergő Tisza) [16:02:24] kostajh, should be at mwdebug1002. Please test and let me know! [16:02:32] (03PS2) 10Milimetric: Enable sending JS errors to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519243 (https://phabricator.wikimedia.org/T217142) (owner: 10Gergő Tisza) [16:02:32] thanks Urbanecm , looking [16:02:44] milimetric, I saw you +2'ed a patch. I'm currently deploying SWAT, fyi [16:02:59] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5006.eqsin.wmnet'] ` and were **ALL** successful. [16:03:15] Urbanecm: did so at tgr's request, I think it's supposed to be swatted? [16:04:02] milimetric: normally the reviewer +1s and the SWATter +2s [16:04:07] thanks tgr [16:04:24] sorry, I am so bad at remembering these conventions [16:04:32] tgr, milimetric: Should I SWAT the merged patch to avoid any problems with patches in the way? [16:04:42] mediawiki-config is one of the few exceptions where merging needs to go hand-in-hand with deploying because there are no weekly branches [16:04:44] !log pool cp5006 w/ ATS backend T226477 [16:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:50] T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 [16:04:54] Urbanecm: let's do it :) [16:05:12] Urbanecm: yes, please add it to the SWAT, thx. I'll add it to the page [16:05:20] kostajh, does that mean the homepage patch is working? :) [16:05:30] 10Operations, 10Operations-Software-Development: cumin could use randomization/splay options - https://phabricator.wikimedia.org/T164587 (10crusnov) After looking into this a bit, the details of how this would be done are a bit involved; since internally cumin uses a NodeSet from clustershell, which acts like... [16:06:04] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519243 (https://phabricator.wikimedia.org/T217142) (owner: 10Gergő Tisza) [16:06:06] tgr, ok, doing [16:06:26] !log ppchelko@deploy1001 Started deploy [restbase/deploy@a915f69]: Really revert [16:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:39] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes - https://phabricator.wikimedia.org/T226589 (10ema) [16:06:42] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ema) 05Open→03Resolved Done. [16:06:52] (03Merged) 10jenkins-bot: Enable sending JS errors to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519243 (https://phabricator.wikimedia.org/T217142) (owner: 10Gergő Tisza) [16:07:30] tgr, it's live at mwdebug1002, if it can be tested there [16:07:42] (03CR) 10jenkins-bot: Enable sending JS errors to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519243 (https://phabricator.wikimedia.org/T217142) (owner: 10Gergő Tisza) [16:07:59] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:08:00] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@a915f69]: Really revert (duration: 01m 35s) [16:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:27] Urbanecm: When you are finished with the swat, could you let me know? I might have something to deploy. :) [16:08:35] Niharika, sure, will do [16:08:43] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:10:22] kostajh, ping? [16:10:42] Urbanecm: sorry, yes, let's deploy [16:10:49] hm, doesn't seem to work, but it's a resourceloader patch so there might be some caching involved [16:10:57] kostajh, thanks, deploying [16:11:24] (03CR) 10Isaac Johnson: [C: 03+1] "Everything looks good to me. Just a note that I'll be leading the deployment of this patch." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519216 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [16:11:49] tgr, do you want me to deploy anyway or revert? [16:12:32] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:518779|GrowthExperiments: Enable mobile homepage for cswiki and kowiki]] (T225676) (duration: 00m 56s) [16:12:37] kostajh, deployed! [16:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:39] T225676: Homepage: mobile MVP list - https://phabricator.wikimedia.org/T225676 [16:12:43] duh, I'm being stupid, it's a beta-only patch so of course it can't be tested via mwdebug [16:12:44] thanks Urbanecm [16:12:49] Urbanecm: deploy it please [16:13:16] tgr, ok [16:13:18] yw kostajh [16:14:23] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518390 (https://phabricator.wikimedia.org/T226315) (owner: 10Zoranzoki21) [16:14:35] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518391 (https://phabricator.wikimedia.org/T226315) (owner: 10Zoranzoki21) [16:14:43] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: [[:gerrit:519243|Enable sending JS errors to EventGate]] (T217142) (duration: 00m 55s) [16:14:47] tgr, done [16:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:54] T217142: [WIP] [Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors - https://phabricator.wikimedia.org/T217142 [16:15:16] (03Merged) 10jenkins-bot: Change name of Serbian Wikinews (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518390 (https://phabricator.wikimedia.org/T226315) (owner: 10Zoranzoki21) [16:15:30] Urbanecm: thanks! [16:15:33] yw tgr [16:15:43] !log Pooling restbase1007 back [16:15:45] (03PS3) 10Urbanecm: Change name of Serbian Wikinews in InitialiseSettings.php (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518391 (https://phabricator.wikimedia.org/T226315) (owner: 10Zoranzoki21) [16:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:04] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "SWAT, gate pipeline succeeded, so overriding to save time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518391 (https://phabricator.wikimedia.org/T226315) (owner: 10Zoranzoki21) [16:16:10] (03CR) 10Filippo Giunchedi: "nit line, LGTM otherwise" (031 comment) [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519068 (owner: 10Cwhite) [16:16:19] (03CR) 10jenkins-bot: Change name of Serbian Wikinews (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518390 (https://phabricator.wikimedia.org/T226315) (owner: 10Zoranzoki21) [16:17:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1ed, but this should be discussed on the SRE monday meeting as it's an addition of a sudo privilege." [puppet] - 10https://gerrit.wikimedia.org/r/517140 (owner: 1020after4) [16:18:10] (03CR) 10jenkins-bot: Change name of Serbian Wikinews in InitialiseSettings.php (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518391 (https://phabricator.wikimedia.org/T226315) (owner: 10Zoranzoki21) [16:18:41] (03PS5) 10Cwhite: branched from tags/v2.0.0 and added debian directory [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519068 [16:19:04] (03CR) 10Cwhite: [C: 03+2] branched from tags/v2.0.0 and added debian directory (031 comment) [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519068 (owner: 10Cwhite) [16:19:12] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: [[:gerrit:518390|Change name of Serbian Wikinews (part 1)]] (T226315) (duration: 00m 56s) [16:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:18] T226315: Change name of Serbian Wikinews - https://phabricator.wikimedia.org/T226315 [16:20:21] !log Purged srwikinews.png, srwikinews-1.5x.png, srwikinews-2x.png (T226315) [16:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:54] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:518391|Change name of Serbian Wikinews in InitialiseSettings.php (part 2)]] (T226315) (duration: 00m 55s) [16:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:20] (03PS7) 10Urbanecm: Set default aliases for Project_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070) (owner: 10Ammarpad) [16:22:34] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070) (owner: 10Ammarpad) [16:23:39] (03Merged) 10jenkins-bot: Set default aliases for Project_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070) (owner: 10Ammarpad) [16:23:49] milimetric: so Chrome at least blocks that URL due to mixed content [16:23:53] (03CR) 10jenkins-bot: Set default aliases for Project_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070) (owner: 10Ammarpad) [16:23:56] can we use HTTPS instead? [16:24:09] tgr: we should ask ottomata [16:24:19] probably, I don't see why not [16:24:28] but I'm not sure if it's set up [16:25:17] (spinning up vagrant to test) [16:25:19] manually sending to https://eventgate-logging.wmflabs.org/v1/events seems to work [16:25:26] so let's do it [16:25:27] oh ok, then yea [16:25:46] !log urbanecm@deploy1001 scap failed: average error rate on 11/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [16:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:54] upps [16:26:22] oops [16:26:58] (03CR) 10Filippo Giunchedi: [C: 03+2] branched from tags/v2.0.0 and added debian directory [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519068 (owner: 10Cwhite) [16:27:13] it says " Generic connection error", so it might be a temporary problem [16:28:27] (03PS1) 10Gergő Tisza: Fix $wgSentryEventGateUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519249 (https://phabricator.wikimedia.org/T217142) [16:28:39] (03CR) 10Milimetric: [C: 03+1] Fix $wgSentryEventGateUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519249 (https://phabricator.wikimedia.org/T217142) (owner: 10Gergő Tisza) [16:28:51] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reverting change scap had problems with (duration: 00m 55s) [16:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:45] Urbanecm: could you deploy the fix for the previous deploy? https://gerrit.wikimedia.org/r/519249 [16:29:52] certainly tgr [16:29:52] (03PS1) 10Urbanecm: Revert "Set default aliases for Project_talk namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519250 (https://phabricator.wikimedia.org/T173070) [16:30:07] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519250 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm) [16:30:38] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, let's merge and start using it." [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [16:31:03] (03Merged) 10jenkins-bot: Revert "Set default aliases for Project_talk namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519250 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm) [16:31:14] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519249 (https://phabricator.wikimedia.org/T217142) (owner: 10Gergő Tisza) [16:32:00] (03CR) 10jenkins-bot: Revert "Set default aliases for Project_talk namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519250 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm) [16:32:07] (03Merged) 10jenkins-bot: Fix $wgSentryEventGateUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519249 (https://phabricator.wikimedia.org/T217142) (owner: 10Gergő Tisza) [16:32:25] I guess the nice way to deploy beta-only patches is by disabling puppet on beta, cherrypicking to beta-deployment and scap pulling to the beta appserver? [16:32:45] (to test beta-only patches, I mean [16:32:54] well, something to figure out the next time [16:33:56] (03CR) 10jenkins-bot: Fix $wgSentryEventGateUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519249 (https://phabricator.wikimedia.org/T217142) (owner: 10Gergő Tisza) [16:35:29] Lucas_WMDE, wondering about the scap failed message. Don't see anything relevant in https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040, also it failed with conn failure, not sure what it means [16:36:42] (03CR) 10Jforrester: "It'd have been nice if this had been in CommonSettings-labs.php, given that we must not ever enable this in actual prod pointing at wmflab" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519243 (https://phabricator.wikimedia.org/T217142) (owner: 10Gergő Tisza) [16:37:23] (03CR) 10Urbanecm: "Reverted in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/519250, problems with scap. More details will be in task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070) (owner: 10Ammarpad) [16:37:54] scap is taking an ethernity... [16:39:15] the Sentry change doesn't need to be scapped, FWIW [16:40:06] (03CR) 10Gehel: [C: 04-1] icinga: fix zero division error for mjolnir bulk update alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519201 (https://phabricator.wikimedia.org/T225904) (owner: 10Mathew.onipe) [16:40:23] tgr, aha [16:41:53] (03CR) 10Gergő Tisza: "Followup for the wrong URL in I82234f47b." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519243 (https://phabricator.wikimedia.org/T217142) (owner: 10Gergő Tisza) [16:42:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Left some comments, but overall looks fine" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [16:42:41] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: [[:gerrit:519249|Fix $wgSentryEventGateUri]] (T217142) (duration: 09m 52s) [16:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:47] T217142: [WIP] [Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors - https://phabricator.wikimedia.org/T217142 [16:43:03] Niharika, I'm done, SWAT is yours [16:44:00] (03PS1) 10RobH: testing out partman for new ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/519252 (https://phabricator.wikimedia.org/T226444) [16:44:04] Urbanecm: Thanks. [16:44:09] yw [16:44:12] Urbanecm: what was the error from the canaries? was it the “recursion detected in RequestContext::getLanguage()”? [16:44:28] Lucas_WMDE, I'll paste the output to the task (T173070) [16:44:28] T173070: Set default aliases for Project_talk namespace - https://phabricator.wikimedia.org/T173070 [16:44:39] (03CR) 10RobH: [C: 03+2] testing out partman for new ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/519252 (https://phabricator.wikimedia.org/T226444) (owner: 10RobH) [16:45:04] Lucas_WMDE, it's there [16:45:41] wait what [16:45:47] scap wasn’t allowed to *read* the error rate from logstash? [16:45:51] is that what I’m seeing? [16:46:02] yeah :) [16:46:11] no, wait, not necessarily “not allowed”, just “not able” [16:46:25] (misunderstood the “max retries” for a server-imposed rate limit) [16:46:29] Might have been a service blip on the logstash service? [16:47:01] It's an infrastructure error, not an appserver error. [16:47:21] It'd be really impressive (but, sadly, not impossible) that a MediaWiki appserver config change would break logstash. [16:47:33] Welcome to the Wikimedia code Hellmouth. [16:48:57] (03PS1) 10Gergő Tisza: Improve Sentry config organization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519253 [16:49:34] (03PS2) 10Gergő Tisza: Improve Sentry config organization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519253 [16:49:57] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) Thanks for the quick replies, it passes a commit check, will push the following shortly. `lang=diff [edit policy-options policy-statement BGP_sanitize_in then] community delete AS14907:ALL { ... } + commun... [16:50:39] (03CR) 10Gergő Tisza: "> It'd have been nice if this had been in CommonSettings-labs.php, given that we must not ever enable this in actual prod pointing at wmfl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519243 (https://phabricator.wikimedia.org/T217142) (owner: 10Gergő Tisza) [16:53:18] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10JobSnijders) These are non-transitive extended communities. They can not cross an EBGP boundary, the deletion in `policy-statement BGP_sanitize_in` is perhaps superfluous. [16:53:31] (03CR) 10Jforrester: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi) [16:55:48] milimetric: the errors are received by EventGate now. I don't see anything in beta logstash, is some piece still missing for that? [16:58:02] (03PS1) 10Urbanecm: Revert "Revert "Set default aliases for Project_talk namespace"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519257 [16:58:21] (03PS2) 10Urbanecm: Revert "Revert "Set default aliases for Project_talk namespace"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519257 (https://phabricator.wikimedia.org/T173070) [16:58:49] tgr yeah godog said he was going to set that up [16:59:02] yeah likely ingesting from the right kafka topic [16:59:12] I have to go shortly but I'll hook that up tomorrow [16:59:19] ottomata: what's the kafka topic(s) ? [17:00:17] (03CR) 10Urbanecm: [C: 04-1] "cswiki and kowiki was already done in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/518779" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519082 (https://phabricator.wikimedia.org/T215983) (owner: 10Catrope) [17:01:26] godog: eqiad.client.error [17:01:45] (03PS3) 10Alexandros Kosiaris: ganeti: Setup buster and a software RAID5 recipe [puppet] - 10https://gerrit.wikimedia.org/r/519075 (https://phabricator.wikimedia.org/T224603) [17:02:00] tgr: there's no dt timestamp in the data? [17:02:04] ottomata: kk thank you [17:02:07] gotta go [17:02:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: Setup buster and a software RAID5 recipe [puppet] - 10https://gerrit.wikimedia.org/r/519075 (https://phabricator.wikimedia.org/T224603) (owner: 10Alexandros Kosiaris) [17:03:22] ottomata: should there be? this is the raw Sentry response, with only $schema and meta added [17:06:23] PROBLEM - puppet last run on dns2002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [17:06:39] tgr aye [17:06:56] dunno, it seems like there should be, otherwise how are you going to know when the error happened? [17:08:32] by adding a timestamp on the entry point, presumably [17:08:56] doing it on client side means you have to rely on the client's clock being accurate [17:09:14] which often isn't the case [17:19:33] (03PS1) 10Bstorm: toolforge: configure kubernetes node using TLS instead of token auth [puppet] - 10https://gerrit.wikimedia.org/r/519259 (https://phabricator.wikimedia.org/T215531) [17:28:49] (03PS1) 10Elukey: profile::hadoop::common: set r+o to the trustore file [puppet] - 10https://gerrit.wikimedia.org/r/519263 (https://phabricator.wikimedia.org/T212259) [17:28:50] (03PS28) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [17:28:52] (03CR) 10Elukey: [C: 03+2] profile::hadoop::common: set r+o to the trustore file [puppet] - 10https://gerrit.wikimedia.org/r/519263 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [17:29:18] (03CR) 10Ayounsi: "> Patch Set 27: Code-Review-1" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [17:37:07] (03PS3) 10Mathew.onipe: icinga: fix zero division error for mjolnir bulk update alert [puppet] - 10https://gerrit.wikimedia.org/r/519201 (https://phabricator.wikimedia.org/T225904) [17:37:38] (03CR) 10Mathew.onipe: icinga: fix zero division error for mjolnir bulk update alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519201 (https://phabricator.wikimedia.org/T225904) (owner: 10Mathew.onipe) [17:39:05] RECOVERY - puppet last run on dns2002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:43:03] (03PS5) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [17:43:54] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [17:44:23] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stats machines hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Nuria) [17:44:39] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Nuria) [17:45:48] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Nuria) I think @ACraze needs to be added also to NDA group so he can gain access to analytics tools such us turnilo/superset [17:47:42] (03PS6) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [17:47:46] (03PS1) 10RobH: Revert "testing out partman for new ganeti hosts" [puppet] - 10https://gerrit.wikimedia.org/r/519264 [17:48:06] (03CR) 10jerkins-bot: [V: 04-1] Revert "testing out partman for new ganeti hosts" [puppet] - 10https://gerrit.wikimedia.org/r/519264 (owner: 10RobH) [17:48:36] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [17:50:56] (03PS2) 10RobH: Revert "testing out partman for new ganeti hosts" [puppet] - 10https://gerrit.wikimedia.org/r/519264 [17:51:45] (03CR) 10RobH: [C: 03+2] Revert "testing out partman for new ganeti hosts" [puppet] - 10https://gerrit.wikimedia.org/r/519264 (owner: 10RobH) [17:52:09] !log finished migration of kafka2002 to kafka-main2002 — enabling alert notifications for kafka-main2002, and leaving kafka2002 disabled T225005 [17:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:15] T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 [17:52:26] (03PS1) 10Herron: Revert "kafka-main2002 disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/519265 [17:53:02] (03PS2) 10Jcrespo: mariadb: Prepare core for buster [puppet] - 10https://gerrit.wikimedia.org/r/519073 (https://phabricator.wikimedia.org/T193224) [17:53:05] (03PS7) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [17:53:40] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [17:54:52] (03PS1) 10Jforrester: [DNM] CI verification commit [deployment-charts] - 10https://gerrit.wikimedia.org/r/519266 [17:56:28] (03PS8) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [17:57:13] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [17:57:57] (03Abandoned) 10CDanis: check_prometheus: allow non-grafana links in $dashboard_links [puppet] - 10https://gerrit.wikimedia.org/r/508748 (owner: 10CDanis) [17:58:17] (03CR) 10Jforrester: "check experimental" [deployment-charts] - 10https://gerrit.wikimedia.org/r/519266 (owner: 10Jforrester) [17:58:35] (03CR) 10Jbond: Bird anycast: add anycast_healthchecker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [17:58:37] (03Abandoned) 10CDanis: mtail::program notify Service['mtail'] by default [puppet] - 10https://gerrit.wikimedia.org/r/478669 (owner: 10CDanis) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190626T1800) [18:03:12] (03PS9) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [18:04:04] (03PS29) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [18:04:08] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [18:06:26] (03PS10) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [18:07:15] (03PS1) 10Andrew Bogott: nova-compute: consolidate a bunch of code that isn't distro-specific [puppet] - 10https://gerrit.wikimedia.org/r/519268 [18:07:18] (03CR) 10Ayounsi: "Thanks, compiler is now happy: https://puppet-compiler.wmflabs.org/compiler1002/17127/dns1001.wikimedia.org/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [18:10:23] (03PS2) 10Herron: Revert "kafka-main2002 disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/519265 [18:11:41] (03CR) 10Herron: [C: 03+2] Revert "kafka-main2002 disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/519265 (owner: 10Herron) [18:11:47] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [18:11:57] (03PS2) 10Andrew Bogott: nova-compute: consolidate a bunch of code that isn't distro-specific [puppet] - 10https://gerrit.wikimedia.org/r/519268 [18:18:08] (03CR) 10Jcrespo: "This is not perfect, but it should be simple enough to get what I want to do (*not finished, requires secret handling*)." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [18:27:57] (03PS2) 10Bstorm: toolforge: configure kubernetes node using TLS instead of token auth [puppet] - 10https://gerrit.wikimedia.org/r/519259 (https://phabricator.wikimedia.org/T215531) [18:32:10] (03CR) 10BBlack: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [18:32:15] (03Abandoned) 10Catrope: Enable mobile homepage on cswiki, kowiki, viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519082 (https://phabricator.wikimedia.org/T215983) (owner: 10Catrope) [18:32:30] (03PS2) 10Catrope: Enable GrowthExperiments homepage on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519080 (https://phabricator.wikimedia.org/T218237) [18:34:39] (03PS1) 10Herron: kafka2002 move to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/519271 (https://phabricator.wikimedia.org/T225005) [18:37:32] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/17130/" [puppet] - 10https://gerrit.wikimedia.org/r/519271 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [18:40:05] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:41:39] (03CR) 10Herron: [C: 03+2] kafka2002 move to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/519271 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [18:44:40] 10Operations: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) a:05RobH→03None [18:45:23] 10Operations: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) a:03akosiaris Assigning to Alex for ganeti setup. Not 100% if this is an Alex project (due to ganeti) or a #traffic project (due to being in our caching centers), but going with Alex first. [18:48:22] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 3 others: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10herron) [18:50:11] 10Operations, 10Traffic: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10BBlack) a:05akosiaris→03None I don't think anyone's 100% sure how we're handling this project, but probably Traffic will figure out the setup for these and ask Alex if we need help. We probably won't... [18:54:16] PROBLEM - High lag on wdqs1003 is CRITICAL: 3663 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:56:08] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:57:10] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:59:22] 10Operations, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10alaa_wmde) 05Open→03Resolved a:03alaa_wmde yeap confirmed.. I can access Turnilo now. Resolving it for now, and a new one for access to hadoop would be created if needed... [19:00:04] longma: That opportune time is upon us again. Time for a MediaWiki train - American version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190626T1900). [19:03:26] (03PS1) 10Herron: kafka-main: replace kafka2001 hardware with kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/519273 (https://phabricator.wikimedia.org/T225005) [19:03:40] (03PS1) 10Jeena Huneidi: group1 wikis to 1.34.0-wmf.11 refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519274 [19:03:43] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.34.0-wmf.11 refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519274 (owner: 10Jeena Huneidi) [19:04:41] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10jbond) @Varnent sorry for the delay on our side, im just picking this u and should be able to complete it tomorrow. Just wanted to confirm that the ComCom maili... [19:04:50] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.11 refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519274 (owner: 10Jeena Huneidi) [19:04:57] (03PS3) 10Andrew Bogott: nova-compute: consolidate a bunch of code that isn't distro-specific [puppet] - 10https://gerrit.wikimedia.org/r/519268 [19:04:59] (03PS1) 10Andrew Bogott: nova-compute: remove libvirt_type params [puppet] - 10https://gerrit.wikimedia.org/r/519275 [19:05:01] (03PS1) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) [19:05:04] longma: I'm here to help if you need anything, looks like you'e got it though [19:05:23] oh thanks. I am afraid :P [19:05:58] (03CR) 10jerkins-bot: [V: 04-1] nova-compute: remove libvirt_type params [puppet] - 10https://gerrit.wikimedia.org/r/519275 (owner: 10Andrew Bogott) [19:06:15] (03CR) 10jerkins-bot: [V: 04-1] nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott) [19:06:38] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.11 refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519274 (owner: 10Jeena Huneidi) [19:06:45] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.11 refs T220736 [19:07:41] !log jhuneidi@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.11 refs T220736 (duration: 00m 56s) [19:07:42] PROBLEM - puppet last run on cloudstore1009 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [19:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:41] T220736: 1.34.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T220736 [19:08:49] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Halfak) I don't believe that "nda" is a unix group. He has already been added to the nda ldap group. See {T225956} [19:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:57] longma, don't be afraid. one of the foundation values is to be italic. [19:10:42] (03PS3) 10Catrope: Enable GrowthExperiments homepage on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519080 (https://phabricator.wikimedia.org/T218237) [19:11:22] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:11:29] (03PS2) 10Catrope: Enable GrowthExperiments homepage for 50% of new users on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519081 [19:12:06] i'm not sure what that means but thanks liw! [19:12:31] longma, it's actually "be bold", but I was trying to be funny [19:12:49] :D [19:17:50] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:18:41] (03PS4) 10Andrew Bogott: nova-compute: consolidate a bunch of code that isn't distro-specific [puppet] - 10https://gerrit.wikimedia.org/r/519268 [19:18:43] (03PS2) 10Andrew Bogott: nova-compute: remove libvirt_type params [puppet] - 10https://gerrit.wikimedia.org/r/519275 [19:18:45] (03PS2) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) [19:19:54] (03CR) 10jerkins-bot: [V: 04-1] nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott) [19:24:16] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:24:38] (03CR) 10Jforrester: "Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519253 (owner: 10Gergő Tisza) [19:25:11] (03PS3) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) [19:29:57] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Nuria) let's see: @ACraze can you access http://turnilo.wikimedia.org [19:38:55] (03PS2) 10Ottomata: Add 'Z' suffix to webrequest log dt format [puppet] - 10https://gerrit.wikimedia.org/r/516528 (https://phabricator.wikimedia.org/T217040) [19:39:04] RECOVERY - puppet last run on cloudstore1009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:39:06] (03CR) 10Jhedden: "Could we reuse `base::expose_puppet_certs` for this? Few other comments, but overall it looks good." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott) [19:42:03] !log file-read-backwards v2.0.0 deployed to apt repo [19:42:03] (03PS4) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) [19:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:58] (03PS5) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) [19:45:19] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519080 (https://phabricator.wikimedia.org/T218237) (owner: 10Catrope) [19:45:47] (03CR) 10Andrew Bogott: nova-compute: use puppet certs for libvirt (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott) [19:45:56] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519081 (owner: 10Catrope) [19:49:03] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10jbond) >>! In T226204#5286992, @Halfak wrote: > I don't believe that "nda" is a unix group. He has already been added to... [19:49:17] (03CR) 10Cwhite: "> Patch Set 9:" (031 comment) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [19:50:33] (03CR) 10Ottomata: [C: 03+1] Production shell: create shell account for accraze [puppet] - 10https://gerrit.wikimedia.org/r/518953 (https://phabricator.wikimedia.org/T226204) (owner: 10Jbond) [19:50:52] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Ottomata) +1 given! [19:52:35] (03CR) 10Jbond: [C: 03+2] Production shell: create shell account for accraze [puppet] - 10https://gerrit.wikimedia.org/r/518953 (https://phabricator.wikimedia.org/T226204) (owner: 10Jbond) [19:52:44] (03PS2) 10Jbond: Production shell: create shell account for accraze [puppet] - 10https://gerrit.wikimedia.org/r/518953 (https://phabricator.wikimedia.org/T226204) [20:00:04] cscott, arlolra, subbu, bearND, and halfak: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190626T2000). [20:00:33] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10jbond) thanks @Ottomata i have merged now will wait for confirmation that all access is enabled before closing. >>! I... [20:01:45] (03PS5) 10Elukey: analytics::refinery::job::data_purge add deletion for data_quality_hourly [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [20:12:07] (03PS6) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) [20:13:00] (03CR) 10jerkins-bot: [V: 04-1] nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott) [20:13:35] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17135/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [20:19:11] (03PS7) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) [20:30:37] ...about to start parsoid deploy to production... [20:33:07] (03PS3) 10Bstorm: toolforge: configure kubernetes node using TLS instead of token auth [puppet] - 10https://gerrit.wikimedia.org/r/519259 (https://phabricator.wikimedia.org/T215531) [20:35:14] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@41a86f8]: Merge "Update prod config template to pass thru accept-language to the MW API" [20:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:19] !log cscott@deploy1001 Started deploy [parsoid/deploy@3d20703]: Updating Parsoid to 31d356a5 (ensure proper source texts when parsing) [20:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:11] Noting for posterity - added cparle to wmf-deployment group on Gerrit (already has access to the cluster and deploy rights there, just needed to be able to merge config patches) [20:37:30] !log bsitzmann@deploy1001 deploy aborted: Merge "Update prod config template to pass thru accept-language to the MW API" (duration: 02m 15s) [20:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:58] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@41a86f8]: Merge "Update prod config template to pass thru accept-language to the MW API" [20:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:50] jbond42 test [20:47:53] (03PS1) 10Volans: dbconfig: fix schema validation [software/conftool] - 10https://gerrit.wikimedia.org/r/519282 [20:47:55] (03PS1) 10Volans: dbconfig: include example in instance edit [software/conftool] - 10https://gerrit.wikimedia.org/r/519283 [20:48:15] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@41a86f8]: Merge "Update prod config template to pass thru accept-language to the MW API" (duration: 03m 17s) [20:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:38] (03PS1) 10CDanis: dbctl: respect --scope in config diff, config commit [software/conftool] - 10https://gerrit.wikimedia.org/r/519284 [20:49:27] (03CR) 10CDanis: [C: 03+2] dbconfig: fix schema validation [software/conftool] - 10https://gerrit.wikimedia.org/r/519282 (owner: 10Volans) [20:49:59] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@85fc707]: Update mobileapps to 4f9b376 [20:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:30] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T226599 (10Volans) @elukey you can disable icinga notification on a cluster via hiera. Alternatively to disable only this kind of check you can disable event handler from Icinga UI. I'm not sure if we have any more f... [20:51:56] (03Merged) 10jenkins-bot: dbconfig: fix schema validation [software/conftool] - 10https://gerrit.wikimedia.org/r/519282 (owner: 10Volans) [20:52:07] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@85fc707]: Update mobileapps to 4f9b376 (duration: 02m 08s) [20:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:58] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1168 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:55:40] (03CR) 10CDanis: [C: 03+2] dbconfig: include example in instance edit [software/conftool] - 10https://gerrit.wikimedia.org/r/519283 (owner: 10Volans) [20:56:13] !log cscott@deploy1001 Finished deploy [parsoid/deploy@3d20703]: Updating Parsoid to 31d356a5 (ensure proper source texts when parsing) (duration: 20m 55s) [20:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:55] (03Merged) 10jenkins-bot: dbconfig: include example in instance edit [software/conftool] - 10https://gerrit.wikimedia.org/r/519283 (owner: 10Volans) [20:58:38] !log added cparle to wmf-deployment group on Gerrit (already has deploy access) [20:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] matthiasmullie and cormacparle__: #bothumor My software never has bugs. It just develops random features. Rise for Structured Data on Commons. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190626T2100). [21:02:38] (03PS2) 10Cparle: [SDC] Enable other statements on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519047 (owner: 10Matthias Mullie) [21:02:52] (03CR) 10Cparle: [C: 03+2] [SDC] Enable other statements on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519047 (owner: 10Matthias Mullie) [21:03:48] (03Merged) 10jenkins-bot: [SDC] Enable other statements on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519047 (owner: 10Matthias Mullie) [21:04:02] (03CR) 10jenkins-bot: [SDC] Enable other statements on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519047 (owner: 10Matthias Mullie) [21:09:42] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Nuria) @jbond for tunilo i believe wmf-nda is needed [21:15:31] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10jbond) >>! In T226204#5287348, @Nuria wrote: > @jbond for tunilo i believe wmf-nda is needed ack, ill double check with @MoritzMuehlenhoff [21:16:03] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10ACraze) @Nuria I'm not able to access http://turnilo.wikimedia.org [21:17:19] (03CR) 10Volans: [C: 04-1] "LGTM, one nit inline. But this depends on a new release of spicerack (hence the -1, just as a reminder)." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/519244 (owner: 10CRusnov) [21:22:25] !log cparle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SDC: Enable other statements on test commons (duration: 00m 58s) [21:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:17] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10MoritzMuehlenhoff) Adding to cn=wmf is fine and can always be done right away (the staff status covers the NDA angle). Adding to the statistics... [21:24:40] (03PS2) 10Volans: dbctl: respect --scope in config diff, config commit [software/conftool] - 10https://gerrit.wikimedia.org/r/519284 (owner: 10CDanis) [21:25:20] Our deployment is done, the rest of the slot is available for anyone who wants it [21:26:36] Congrats cormacparle__! (first deployment ever!) [21:28:03] cormacparle__: Woo. [21:28:06] (03CR) 10Jforrester: [C: 03+2] Improve Sentry config organization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519253 (owner: 10Gergő Tisza) [21:28:19] cormacparle__, marktraceur: I played around on https://test-commons.wikimedia.org/wiki/File:Godward_Idleness_1900-dupe!.jpg ;-) [21:29:16] (03CR) 10Volans: [C: 03+2] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/519284 (owner: 10CDanis) [21:31:48] (03PS3) 10Jforrester: Improve Sentry config organization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519253 (owner: 10Gergő Tisza) [21:31:52] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519253 (owner: 10Gergő Tisza) [21:32:07] (03Merged) 10jenkins-bot: dbctl: respect --scope in config diff, config commit [software/conftool] - 10https://gerrit.wikimedia.org/r/519284 (owner: 10CDanis) [21:32:57] (03Merged) 10jenkins-bot: Improve Sentry config organization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519253 (owner: 10Gergő Tisza) [21:33:12] (03CR) 10jenkins-bot: Improve Sentry config organization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519253 (owner: 10Gergő Tisza) [21:35:01] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Explicitly set wgSentryEventGateUri to false in prod IS (duration: 00m 56s) [21:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:47] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Don't set wgSentryEventGateUri in prod CS (duration: 00m 55s) [21:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:39] (03CR) 10Smalyshev: [C: 03+1] wdqs: publish full MDC in file based logs. [puppet] - 10https://gerrit.wikimedia.org/r/519046 (owner: 10Gehel) [22:08:20] (03PS1) 10Volans: dbconfig: improve help message for --group option [software/conftool] - 10https://gerrit.wikimedia.org/r/519300 [22:08:22] (03PS1) 10Volans: dbconfig: add support for 'instance all get' [software/conftool] - 10https://gerrit.wikimedia.org/r/519301 [22:09:17] (03CR) 10CDanis: dbconfig: improve help message for --group option (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/519300 (owner: 10Volans) [22:11:11] (03PS2) 10Volans: dbconfig: improve help message for --group option [software/conftool] - 10https://gerrit.wikimedia.org/r/519300 [22:11:13] (03PS2) 10Volans: dbconfig: add support for 'instance all get' [software/conftool] - 10https://gerrit.wikimedia.org/r/519301 [22:13:35] (03CR) 10CDanis: [C: 03+2] dbconfig: add support for 'instance all get' [software/conftool] - 10https://gerrit.wikimedia.org/r/519301 (owner: 10Volans) [22:13:53] (03CR) 10CDanis: [C: 03+2] dbconfig: improve help message for --group option [software/conftool] - 10https://gerrit.wikimedia.org/r/519300 (owner: 10Volans) [22:16:45] (03Merged) 10jenkins-bot: dbconfig: improve help message for --group option [software/conftool] - 10https://gerrit.wikimedia.org/r/519300 (owner: 10Volans) [22:16:45] (03Merged) 10jenkins-bot: dbconfig: add support for 'instance all get' [software/conftool] - 10https://gerrit.wikimedia.org/r/519301 (owner: 10Volans) [22:19:49] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10Halfak) Thanks all :) [22:28:12] 10Operations, 10Dumps-Generation, 10hardware-requests: consider getting a third dumpsdata server - https://phabricator.wikimedia.org/T219768 (10RobH) [22:41:22] (03PS2) 10Ppchelko: Clean up configuration for pdfrender service. [puppet] - 10https://gerrit.wikimedia.org/r/514226 (https://phabricator.wikimedia.org/T226675) [22:42:48] (03CR) 10Ppchelko: "I was not entirely sure how to undeploy something, so I just comletely cleaned up puppet of everything related to pdfrender service." [puppet] - 10https://gerrit.wikimedia.org/r/514226 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko) [22:43:31] (03CR) 10Mobrovac: [C: 03+1] Clean up configuration for pdfrender service. [puppet] - 10https://gerrit.wikimedia.org/r/514226 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko) [22:50:40] (03PS8) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) [22:51:32] (03CR) 10jerkins-bot: [V: 04-1] nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott) [23:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190626T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:26] I have a patch, but it's still merging into master [23:00:31] I'll deploy it myself when it's ready [23:12:53] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Development services), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10greg) [23:13:40] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Development services), and 3 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10greg) [23:16:13] (03PS9) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) [23:17:02] (03CR) 10jerkins-bot: [V: 04-1] nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott) [23:18:42] (03PS10) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) [23:19:13] (03CR) 10jerkins-bot: [V: 04-1] nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott) [23:21:41] (03PS11) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) [23:22:13] (03CR) 10jerkins-bot: [V: 04-1] nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott) [23:24:19] (03PS12) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) [23:30:56] (03PS1) 10Andrew Bogott: libvirtd: turn on listening [puppet] - 10https://gerrit.wikimedia.org/r/519315 [23:38:19] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/Echo/modules/nojs/mw.echo.badge.monobook.less: Fix horizontal scrollbars in Monobook (T226594) (duration: 00m 57s) [23:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:26] T226594: Wiki pages are very wide in Monobook for logged in users - https://phabricator.wikimedia.org/T226594 [23:39:39] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.10/extensions/Echo/modules/nojs/mw.echo.badge.monobook.less: Fix horizontal scrollbars in Monobook (T226594) (duration: 00m 55s) [23:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log