[00:12:53] and on a weekend πŸ˜’ [01:35:51] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:37:19] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:24:49] hmm I thought the bigdelete jobqueue was supposed to say it's queued to JQ instead of errorz [02:25:24] got an error but deleted so nothing to worry in the end :-) [03:17:19] PROBLEM - HHVM jobrunner on mw2250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:17:39] PROBLEM - MD RAID on mw2250 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [03:17:41] ACKNOWLEDGEMENT - MD RAID on mw2250 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226948 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [03:17:45] 10Operations, 10ops-codfw: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10ops-monitoring-bot) [03:18:39] RECOVERY - HHVM jobrunner on mw2250 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:55:40] 10Operations, 10observability: Upgrade grafana to 6.x - https://phabricator.wikimedia.org/T220838 (10bd808) > It's also worth looking at testing the upgrade on grafana-labs if they're interested That would be great! And it would help address {T226108} [04:39:59] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1072 is CRITICAL: cluster=mysql device=megaraid,9 instance=db1072:9100 job=node site=eqiad Marostegui T208323 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1072&var-datasource=eqiad+prometheus/ops [04:40:23] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [04:43:47] (03PS1) 10Marostegui: dbproxy: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/519948 (https://phabricator.wikimedia.org/T222978) [04:44:42] (03CR) 10Marostegui: [C: 03+2] dbproxy: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/519948 (https://phabricator.wikimedia.org/T222978) (owner: 10Marostegui) [04:49:01] !log Change pt-kill value on labsdb1009 temporarily, from 300 to 14400 T222978 [04:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:06] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [04:50:41] !log Reload haproxy on dbproxy1010 and dbproxy1011 to depool labsdb1011 - T222978 [04:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:19] * marostegui !log Keep compressing tables on labsdb1011 - T222978 [04:53:29] !log Keep compressing tables on labsdb1011 - T222978 [04:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:09] (03PS2) 10Marostegui: db-eqiad.php: Promote db1120 to x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519187 (https://phabricator.wikimedia.org/T226358) [05:20:10] (03PS3) 10Marostegui: wmnet: Change x1-master to point to the new master [dns] - 10https://gerrit.wikimedia.org/r/519186 (https://phabricator.wikimedia.org/T226358) [05:20:12] (03PS2) 10Marostegui: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/519185 (https://phabricator.wikimedia.org/T226358) [05:25:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Marostegui) [05:30:37] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [05:42:26] (03PS2) 10Ema: vcl: remove Vary:AL workaround for fixcopyright.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/519606 (https://phabricator.wikimedia.org/T203179) [05:43:17] (03CR) 10Ema: [C: 03+2] vcl: remove Vary:AL workaround for fixcopyright.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/519606 (https://phabricator.wikimedia.org/T203179) (owner: 10Ema) [05:57:39] (03PS1) 10Ema: cache: reimage cp2014 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519949 (https://phabricator.wikimedia.org/T226637) [05:57:59] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:06:13] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[load-new-vcl-file] [06:11:41] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:13:54] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp2014 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519949 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [06:16:19] !log depool cp2014 and reimage as upload_ats T226637 [06:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:25] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [06:16:40] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, 10Recommendation-API: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) [06:16:50] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, 10Recommendation-API: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) p:05Triageβ†’03Normal [06:17:24] (03CR) 10Ema: [C: 03+2] cache: reimage cp2014 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519949 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [06:20:35] (03PS1) 10ArielGlenn: defer start of July 1 2019 dumps until evening [puppet] - 10https://gerrit.wikimedia.org/r/519952 [06:23:55] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2014.codfw.wmnet'] ` The log can be found in `... [06:28:33] (03PS2) 10ArielGlenn: defer start of July 1 2019 dumps until evening [puppet] - 10https://gerrit.wikimedia.org/r/519952 [06:30:15] (03CR) 10ArielGlenn: [C: 03+2] defer start of July 1 2019 dumps until evening [puppet] - 10https://gerrit.wikimedia.org/r/519952 (owner: 10ArielGlenn) [06:32:35] PROBLEM - puppet last run on mw1314 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:33:47] (03CR) 10Ema: [C: 03+1] grafana: update varnish-aggregate-client-status-codes to prometheus version [puppet] - 10https://gerrit.wikimedia.org/r/519664 (https://phabricator.wikimedia.org/T184942) (owner: 10Cwhite) [06:33:55] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:50:54] (03PS7) 10Giuseppe Lavagetto: mediawiki::php: add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [06:51:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [06:53:19] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2014.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2014.codfw.wmnet'] ` [06:58:48] RECOVERY - puppet last run on mw1314 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:00:08] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:03:03] (03PS1) 10Elukey: Replace Yarn nodemanager Prometheus exp's host:port to allow IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/519957 (https://phabricator.wikimedia.org/T225296) [07:04:16] !log pool cp2014 w/ ATS backend T226637 [07:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:21] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [07:18:48] 10Operations, 10fundraising-tech-ops: Authentication for grafana - https://phabricator.wikimedia.org/T198648 (10Aklapper) a:05cwdentβ†’03None (Resetting assignee as @cwdent has left WMF) [07:18:51] 10Operations, 10fundraising-tech-ops: Long term storage for frack prometheus data - https://phabricator.wikimedia.org/T175738 (10Aklapper) a:05cwdentβ†’03None (Resetting assignee as @cwdent has left WMF) [07:18:54] 10Operations, 10fundraising-tech-ops: rack/setup/install Prometeuse/Grafana host frmon2001 for fr-tech - https://phabricator.wikimedia.org/T196476 (10Aklapper) a:05cwdentβ†’03None (Resetting assignee as @cwdent has left WMF) [07:21:27] (03PS1) 10Muehlenhoff: Remove access for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/519958 [07:23:58] (03PS2) 10Muehlenhoff: Remove access for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/519958 [07:25:28] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, 10Recommendation-API: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10MoritzMuehlenhoff) debmonitor readonly time is not an issue, the debmonitor clients will simply retry the next time. [07:25:45] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/519958 (owner: 10Muehlenhoff) [07:26:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey) Current status is: ` elukey@ganeti1001:~$ sudo gnt-group list Group Nodes Instances AllocP... [07:29:59] 10Operations, 10Analytics: Reduce memory allocation for kafkamon instances - https://phabricator.wikimedia.org/T224988 (10elukey) I would go down to 4G with (on ganeti1001): ` sudo gnt-instance modify -B memory=4g kafkamon1001.eqiad.wmnet ` Same thing for the codfw instance. From grafana it seems that we co... [07:30:21] 10Operations, 10Analytics, 10Analytics-Kanban: Reduce memory allocation for kafkamon instances - https://phabricator.wikimedia.org/T224988 (10elukey) a:03elukey [07:41:15] (03PS1) 10Giuseppe Lavagetto: wmerrors: enable on the mwdebug servers [puppet] - 10https://gerrit.wikimedia.org/r/519961 (https://phabricator.wikimedia.org/T187147) [07:42:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wmerrors: enable on the mwdebug servers [puppet] - 10https://gerrit.wikimedia.org/r/519961 (https://phabricator.wikimedia.org/T187147) (owner: 10Giuseppe Lavagetto) [07:43:04] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): User[bmansurov] [07:43:56] 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10elukey) @fgiunchedi let me know if the above new metrics (and code if you have time - https://github.com/Dev25/mcrou... [07:52:03] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, 10Recommendation-API: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10jcrespo) Because of the TTL mention, are you planning a failover of proxy at the same time? [07:52:58] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): User[bmansurov] [07:53:15] (03PS1) 10Giuseppe Lavagetto: mwdebug: fix indentation [puppet] - 10https://gerrit.wikimedia.org/r/519963 [07:53:50] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, 10Recommendation-API: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) >>! In T226952#5295201, @jcrespo wrote: > Because of the TTL mention, are you planning a failover of proxy at the same time?... [07:54:10] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, 10Recommendation-API: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) [07:58:00] (03PS1) 10Fsero: ammending patch information [debs/coredns] (debian/sid) - 10https://gerrit.wikimedia.org/r/519965 [07:58:02] (03PS1) 10Fsero: added gbp.conf for initial packaging [debs/coredns] (debian/sid) - 10https://gerrit.wikimedia.org/r/519966 [07:58:28] (03CR) 10Fsero: [V: 03+2 C: 03+2] ammending patch information [debs/coredns] (debian/sid) - 10https://gerrit.wikimedia.org/r/519965 (owner: 10Fsero) [08:00:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: fix indentation [puppet] - 10https://gerrit.wikimedia.org/r/519963 (owner: 10Giuseppe Lavagetto) [08:13:50] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (untested)" [puppet] - 10https://gerrit.wikimedia.org/r/519957 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [08:16:03] (03PS2) 10Elukey: Replace Yarn nodemanager Prometheus exp's host:port to allow IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/519957 (https://phabricator.wikimedia.org/T225296) [08:16:30] (03CR) 10Elukey: [C: 03+2] Replace Yarn nodemanager Prometheus exp's host:port to allow IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/519957 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [08:17:29] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, 10Recommendation-API: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10jcrespo) I am actually proposing to maybe do it, but it needs more work. [08:20:13] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, 10Recommendation-API: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) Let's leave it aside for now :-) [08:20:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/519664 (https://phabricator.wikimedia.org/T184942) (owner: 10Cwhite) [08:24:09] (03CR) 10Daimona Eaytoy: [C: 03+1] Add abusefilter-view-private to checkusers on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519767 (https://phabricator.wikimedia.org/T226899) (owner: 10Urbanecm) [08:29:47] (03CR) 10Muehlenhoff: [C: 03+1] aptrepo: fix the kubeadm packages to include containerd.io [puppet] - 10https://gerrit.wikimedia.org/r/519726 (https://phabricator.wikimedia.org/T215975) (owner: 10Bstorm) [08:31:38] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, 10Recommendation-API: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Volans) For `debmonitor` it connects to `m2-master.eqiad.wmnet` and I'm not sure if Django's connection pooling would be smart enough to... [08:31:41] (03PS1) 10Marostegui: mariadb: Promote db1132 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/519975 (https://phabricator.wikimedia.org/T226952) [08:33:48] (03CR) 10GergΕ‘ Tisza: [C: 03+1] "As per T217142#5292207, this was tested on Beta and seemed to work well." [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) (owner: 10Filippo Giunchedi) [08:35:07] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) >>! In T226952#5295258, @Volans wrote: > For `debmonitor` it connects to `m2-master.eqiad.wmnet` and I'm not sure if Django's connecti... [08:36:02] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) [08:39:36] !log restart hadoop-yarn-nodemanager on all hadoop workers to pick up new jvm settings - T225296 [08:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:44] T225296: High Prometheus TCP retransmits - https://phabricator.wikimedia.org/T225296 [08:41:04] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler1001/17167/" [puppet] - 10https://gerrit.wikimedia.org/r/519975 (https://phabricator.wikimedia.org/T226952) (owner: 10Marostegui) [08:41:11] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/519975 (https://phabricator.wikimedia.org/T226952) (owner: 10Marostegui) [08:46:37] 10Operations, 10ops-codfw: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10MoritzMuehlenhoff) a:03Papaul [08:46:57] 10Operations, 10ops-codfw: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10MoritzMuehlenhoff) Warranty expired a month ago, do we have any spare disks of that type around? [08:48:33] (03PS2) 10ArielGlenn: svwiki officially 'big', 6 dumps jobs in parallel like the others [puppet] - 10https://gerrit.wikimedia.org/r/518189 (https://phabricator.wikimedia.org/T226200) [08:51:13] (03PS2) 10Fsero: added gbp.conf for initial packaging [debs/coredns] (debian/sid) - 10https://gerrit.wikimedia.org/r/519966 [08:51:14] (03PS1) 10Fsero: adding default Corefile [debs/coredns] (debian/sid) - 10https://gerrit.wikimedia.org/r/519977 [08:51:16] (03PS1) 10Fsero: we dont need vendor commited just as a patch [debs/coredns] (debian/sid) - 10https://gerrit.wikimedia.org/r/519978 [08:51:32] (03CR) 10Fsero: [V: 03+2 C: 03+2] adding default Corefile [debs/coredns] (debian/sid) - 10https://gerrit.wikimedia.org/r/519977 (owner: 10Fsero) [08:51:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [software/conftool] - 10https://gerrit.wikimedia.org/r/519753 (owner: 10Volans) [08:51:49] (03CR) 10Fsero: [V: 03+2 C: 03+2] we dont need vendor commited just as a patch [debs/coredns] (debian/sid) - 10https://gerrit.wikimedia.org/r/519978 (owner: 10Fsero) [08:51:58] (03CR) 10ArielGlenn: [C: 03+2] svwiki officially 'big', 6 dumps jobs in parallel like the others [puppet] - 10https://gerrit.wikimedia.org/r/518189 (https://phabricator.wikimedia.org/T226200) (owner: 10ArielGlenn) [08:52:07] (03CR) 10Fsero: [V: 03+2 C: 03+2] added gbp.conf for initial packaging [debs/coredns] (debian/sid) - 10https://gerrit.wikimedia.org/r/519966 (owner: 10Fsero) [09:05:17] 10Operations: conftool: upgrade fleet to use existing python3-conftool - https://phabricator.wikimedia.org/T226965 (10Volans) [09:08:57] 10Operations, 10serviceops: conftool: upgrade fleet to use existing python3-conftool - https://phabricator.wikimedia.org/T226965 (10Joe) p:05Triageβ†’03Normal a:03Joe [09:09:12] (03PS3) 10Ladsgroup: Do not load InitialiseSettings-labs.php multiple times [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514038 (https://phabricator.wikimedia.org/T224899) (owner: 10WMDE-leszek) [09:09:27] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) Note: db2044 needs upgrading [09:09:33] (03CR) 10Ladsgroup: [C: 03+2] "Noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514038 (https://phabricator.wikimedia.org/T224899) (owner: 10WMDE-leszek) [09:10:33] (03Merged) 10jenkins-bot: Do not load InitialiseSettings-labs.php multiple times [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514038 (https://phabricator.wikimedia.org/T224899) (owner: 10WMDE-leszek) [09:10:49] (03CR) 10jenkins-bot: Do not load InitialiseSettings-labs.php multiple times [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514038 (https://phabricator.wikimedia.org/T224899) (owner: 10WMDE-leszek) [09:13:12] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:514038|Noop: Do not load InitialiseSettings-labs.php multiple times (T224899)]] (duration: 00m 51s) [09:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:18] T224899: Fatal error Cannot redeclare wmfLabsSettings() on Beta cluster wikis - https://phabricator.wikimedia.org/T224899 [09:14:12] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [09:14:46] 10Operations, 10serviceops: conftool: upgrade fleet to use existing python3-conftool - https://phabricator.wikimedia.org/T226965 (10Joe) What we need to do is: [] Upgrade python3-etcd to the latest version [] Upgrade python3-conftool to the latest version [] Remove python-conftool if present [09:16:42] <_joe_> !log update python3-etcd, python3-conftool to their latest versions T226965 [09:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:47] T226965: conftool: upgrade fleet to use existing python3-conftool - https://phabricator.wikimedia.org/T226965 [09:17:02] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [09:17:02] <_joe_> Amir1: ohh thanks <3 [09:17:09] <_joe_> I wanted to do that myself [09:17:40] _joe_: You're welcome, I would have done way sooner if I knew that patch existed [09:19:05] (03PS1) 10Elukey: Disable java.net.preferIPv4Stack on the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/519979 (https://phabricator.wikimedia.org/T225296) [09:19:59] (03CR) 10Elukey: [C: 03+2] Disable java.net.preferIPv4Stack on the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/519979 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [09:27:05] 10Operations, 10ops-esams, 10Traffic: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T222041 (10Joe) Can someone start the decommission process? this host shows up in things like debdeploy runs or cumin runs and that's distracting. [09:29:28] dear ops, since there's no puppet swat today, is it possible to take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/519495 which changes grafana logo, we did a similar thing with tendril [09:29:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (untested)" [puppet] - 10https://gerrit.wikimedia.org/r/519662 (owner: 10Cwhite) [09:30:19] godog: ^ [09:32:20] sure, I'll take a look [09:32:34] Thanks! [09:33:10] <_joe_> !log removing python-conftool from all hosts where it's still installed [09:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:15] (03PS2) 10Filippo Giunchedi: grafana: Make the wikimedia logo white [puppet] - 10https://gerrit.wikimedia.org/r/519495 (owner: 10Ladsgroup) [09:39:00] (03PS1) 10Elukey: Bind to IPv6 for Hadoop HDFS daemons on the testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/519980 (https://phabricator.wikimedia.org/T225296) [09:39:09] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: Make the wikimedia logo white [puppet] - 10https://gerrit.wikimedia.org/r/519495 (owner: 10Ladsgroup) [09:40:57] (03PS2) 10Elukey: Bind to IPv6 for Hadoop HDFS daemons on the testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/519980 (https://phabricator.wikimedia.org/T225296) [09:42:17] (03CR) 10Elukey: [C: 03+2] Bind to IPv6 for Hadoop HDFS daemons on the testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/519980 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [09:43:07] (03CR) 10Santhosh: [C: 03+1] Remove Content Translation event logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514672 (owner: 10Petar.petkovic) [09:44:45] <_joe_> please no one deploy right now [09:47:46] 10Operations, 10Analytics, 10Analytics-Kanban: Reduce memory allocation for kafkamon instances - https://phabricator.wikimedia.org/T224988 (10akosiaris) >>! In T224988#5295172, @elukey wrote: > I would go down to 4G with (on ganeti1001): > > ` > sudo gnt-instance modify -B memory=4g kafkamon1001.eqiad.wmne... [09:48:20] Amir1: {{done}}, is there a related task? I think we'll need a few changes still [09:48:37] not yet, I can make one right now [09:48:53] sounds great, thank you [09:52:14] PROBLEM - puppet last run on mw1336 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [09:52:16] 10Operations, 10UI-Standardization: Use white version of Wikimedia logo for grafana - https://phabricator.wikimedia.org/T226970 (10Ladsgroup) [09:52:58] PROBLEM - puppet last run on mw2139 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [09:53:22] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10akosiaris) >>! In T226844#5295161, @elukey wrote: > Current status is: > > ` > elukey@ganeti1001:~$... [09:53:26] 10Operations, 10UI-Standardization: Use white version of Wikimedia logo for grafana - https://phabricator.wikimedia.org/T226970 (10Ladsgroup) [09:54:28] !log swift eqiad-prod eqiad-prod: put back ms-be1033 - T223518 [09:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:33] T223518: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 [09:54:34] !log reboot kafkamon2001 with 4g of dedicated ram (was 8g) - T224988 [09:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:40] T224988: Reduce memory allocation for kafkamon instances - https://phabricator.wikimedia.org/T224988 [09:55:52] !log reboot kafkamon1001 with 4g of dedicated ram (was 8g) - T224988 [09:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:26] PROBLEM - puppet last run on wtp2018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap],Package[parsoid/deploy] [09:56:47] 10Operations, 10Wikimedia-Site-requests: Global rename of Waldir β†’ Waldyrious: supervision needed - https://phabricator.wikimedia.org/T225370 (10Marostegui) This doesn't really need a DBA there is no lag replication lag showing up since we replaced all the old hardware [09:56:57] 10Operations, 10Wikimedia-Site-requests: Global rename of Fiona B. β†’ Fiona*: supervision needed - https://phabricator.wikimedia.org/T224348 (10Marostegui) This doesn't really need a DBA there is no lag replication lag showing up since we replaced all the old hardware [09:57:00] PROBLEM - puppet last run on mw2142 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [09:57:25] kafkamon1001's reboot is taking a long time [09:58:00] done :) [09:58:07] 10Operations, 10UI-Standardization: Use white version of Wikimedia logo for grafana - https://phabricator.wikimedia.org/T226970 (10fgiunchedi) Thanks @ladsgroup ! Looks great to me, when hiding the sidebar though (clicking on the logo) the white version with no outlines basically disappears: {F29673151} [09:59:11] 10Operations, 10UI-Standardization: Use white version of Wikimedia logo for grafana - https://phabricator.wikimedia.org/T226970 (10Ladsgroup) >>! In T226970#5295536, @fgiunchedi wrote: > Thanks @ladsgroup ! Looks great to me, when hiding the sidebar though (clicking on the logo) the white version with no outli... [10:01:40] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:02:09] forcing a puppet run --^ [10:02:58] 10Operations, 10ops-eqiad, 10User-fgiunchedi: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 (10fgiunchedi) 05Openβ†’03Resolved The last rebalance is underway now to put ms-be1033 fully back in service. Resolving. [10:03:02] ah nice! burrow-analytics.service is not needed anymore! [10:04:20] !log remove burrow-analytics.service from kafkamon1001 (the analytics cluster has been decommed) [10:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:33] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: fix the kubeadm packages to include containerd.io [puppet] - 10https://gerrit.wikimedia.org/r/519726 (https://phabricator.wikimedia.org/T215975) (owner: 10Bstorm) [10:04:36] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [10:10:57] jouncebot, next [10:10:57] In 0 hour(s) and 19 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190701T1030) [10:12:10] I shall update the meta templates [10:12:26] (03PS1) 10Elukey: Introduce an-tool1006 [dns] - 10https://gerrit.wikimedia.org/r/519984 (https://phabricator.wikimedia.org/T226844) [10:12:34] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10akosiaris) [10:12:48] (03CR) 10jerkins-bot: [V: 04-1] Introduce an-tool1006 [dns] - 10https://gerrit.wikimedia.org/r/519984 (https://phabricator.wikimedia.org/T226844) (owner: 10Elukey) [10:13:30] missing ganeti comment [10:14:26] (03PS2) 10Elukey: Introduce an-tool1006 [dns] - 10https://gerrit.wikimedia.org/r/519984 (https://phabricator.wikimedia.org/T226844) [10:18:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: fix the kubeadm packages to include containerd.io [puppet] - 10https://gerrit.wikimedia.org/r/519726 (https://phabricator.wikimedia.org/T215975) (owner: 10Bstorm) [10:18:51] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Volans) The automatic gathering times out because megacli takes ~3 minutes to return the status of the disks, it blocks at PD7 (the one broken) and takes very long time to get info from that disk. As stated in T... [10:18:53] jan_drewniak: not sure if it is too late but I've sync'ed the metawiki Www.* templates, will the actual code be deployed or the past version? [10:19:30] RECOVERY - puppet last run on mw1336 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:20:14] RECOVERY - puppet last run on mw2139 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:20:57] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, yes this affects only container registry ATM" [puppet] - 10https://gerrit.wikimedia.org/r/519018 (owner: 10Fsero) [10:23:38] RECOVERY - puppet last run on wtp2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:24:14] RECOVERY - puppet last run on mw2142 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:25:17] 10Operations, 10Icinga, 10Operations-Software-Development, 10observability: ops-monitoring-bot creating dupes - https://phabricator.wikimedia.org/T226908 (10Volans) Yes it's confirmed that the Icinga check flaps between critical and unknown due to time outs and as a result the even handler created the dupe... [10:25:38] PROBLEM - Apache HTTP on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:25:48] 10Operations, 10Icinga, 10Operations-Software-Development, 10observability: ops-monitoring-bot creating dupes - https://phabricator.wikimedia.org/T226908 (10Volans) p:05Highβ†’03Normal [10:26:56] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:28:10] PROBLEM - Check systemd state on ms-be1035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:30:04] jan_drewniak: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190701T1030). [10:31:41] (03CR) 10Filippo Giunchedi: "> Patch Set 10:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [10:31:48] 10Operations, 10Wikimedia-Site-requests: Global rename of Fiona B. β†’ Fiona*: supervision needed - https://phabricator.wikimedia.org/T224348 (10jbond) p:05Triageβ†’03Normal [10:32:22] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10Performance-Team (Radar), and 2 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10jbond) p:05Triageβ†’03Normal [10:32:36] 10Operations, 10serviceops: conftool: upgrade fleet to use existing python3-conftool - https://phabricator.wikimedia.org/T226965 (10Joe) This is blocked until https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/491412 is merged and deployed. I'll take care of it during the week. [10:33:18] 10Operations, 10UI-Standardization: Use white version of Wikimedia logo for grafana - https://phabricator.wikimedia.org/T226970 (10jbond) p:05Triageβ†’03Normal [10:34:22] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10Urbanecm) [10:34:56] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10Urbanecm) [10:35:04] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10jbond) p:05Triageβ†’03Normal [10:37:18] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10MoritzMuehlenhoff) The updated package in boron:~jbond/src/facter-3.11.0 looks good. IMO we can ignore to backport this to jessie? the only cp host on jessie is the obsolete cp1008 which... [10:40:03] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10Urbanecm) [10:41:55] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10fgiunchedi) I took a look at this, and it seems we're hitting a curl timeout on the mw side ? I'm not sure how long... [10:47:53] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10serviceops, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Joe) After installing wmerrors on the test servers, these are my results: - **OOM errors** are now correctly treated: we get th... [10:48:16] does the full protection prevent globalrename-initiated page move? [10:48:57] User page is fully protected but the renamer was local sysop on the particular wiki where the page was protected [10:49:24] (I don't want to link the page for few reasons so maybe post the link at -staff or via some PM?) [10:50:45] 10Operations, 10Wikimedia-Site-requests: Global rename of Waldir β†’ Waldyrious: supervision needed - https://phabricator.wikimedia.org/T225370 (101997kB) 05Openβ†’03Resolved a:031997kB Thanks. I have performed this rename and has been completed successfully. [10:51:53] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10serviceops, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Joe) [10:55:01] revi, https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/includes/LocalRenameJob/LocalPageMoveJob.php#L60 calls MovePage::move(), which, according to https://github.com/wikimedia/mediawiki/blob/master/includes/MovePage.php#L254, "Move[s] a page without taking user permissions into account" [10:55:12] oh kk [10:55:12] so I think it will ignore page protection status [10:56:13] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10jbond) > One thing to consider for buster: So far we've used the facter version in buster, so I think we have two options: I would prefer to go for option two until this is actually an... [10:57:29] (03CR) 10Arturo Borrero Gonzalez: "Great! thanks :-)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [10:58:23] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: install wmerrors everywhere [puppet] - 10https://gerrit.wikimedia.org/r/519986 (https://phabricator.wikimedia.org/T187147) [10:58:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: install wmerrors everywhere [puppet] - 10https://gerrit.wikimedia.org/r/519986 (https://phabricator.wikimedia.org/T187147) (owner: 10Giuseppe Lavagetto) [10:59:07] (03CR) 10Hashar: "I think they are all fine to go and should be ready for deployment. I don't have any production access to do so though no do I know anythi" [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190701T1100). [11:00:04] hauskatze, Urbanecm, kart_, and apergos: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] * apergos is here [11:00:13] I can SWAT today! [11:00:14] meow [11:00:47] o/ [11:01:01] +2'ed apergos's backport, to give time for CI [11:01:06] starting with hauskatze's patch [11:01:13] :) [11:01:16] (03PS4) 10Urbanecm: Close wikimania2018.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) (owner: 10MarcoAurelio) [11:01:18] back in a second [11:01:21] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) (owner: 10MarcoAurelio) [11:01:24] (thank you) [11:02:18] (03Merged) 10jenkins-bot: Close wikimania2018.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) (owner: 10MarcoAurelio) [11:02:34] (03CR) 10jenkins-bot: Close wikimania2018.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) (owner: 10MarcoAurelio) [11:02:37] A bit late. [11:03:17] hi kart_ [11:03:41] Urbanecm: Are we SWAT'ng our selves as usual, right? [11:03:59] In that case, I would like to +2 my wmf.11 patch, if that's fine. [11:04:24] kart_, I've already +2'ed apergos's backport, not sure if +2'ing your backport wouldn't cause any conflicts [11:04:32] hauskatze, your patch is on mwdebug1002 [11:04:55] checking [11:04:58] Urbanecm: oh, let's wait for it then.. [11:05:09] Urbanecm: You can SWAT my patches too :) [11:05:16] will do kart_ [11:05:28] Urbanecm: lgtm [11:05:30] Urbanecm: thanks a lot. Ping me when my turn is there. [11:05:32] hauskatze, deploying [11:05:35] sure kart_ [11:05:37] the edit permission went away as expected [11:06:48] cool! [11:06:53] !log urbanecm@deploy1001 Synchronized dblists/: [[:gerrit:518715|Close wikimania2018.wikimedia.org]] (T201188) (duration: 00m 49s) [11:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:59] T201188: Close wikimania2018wiki (July 2019) - https://phabricator.wikimedia.org/T201188 [11:07:01] hauskatze, should be deployed! [11:07:05] :) [11:07:09] final test [11:07:19] (03PS4) 10Urbanecm: Add abusefilter-view-private to checkusers on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519767 (https://phabricator.wikimedia.org/T226899) [11:07:26] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519767 (https://phabricator.wikimedia.org/T226899) (owner: 10Urbanecm) [11:07:32] (03PS1) 10Matthias Geisler: Enable DataBridge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) [11:07:53] all looks good [11:08:01] great! [11:08:35] (03Merged) 10jenkins-bot: Add abusefilter-view-private to checkusers on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519767 (https://phabricator.wikimedia.org/T226899) (owner: 10Urbanecm) [11:08:50] (03CR) 10jenkins-bot: Add abusefilter-view-private to checkusers on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519767 (https://phabricator.wikimedia.org/T226899) (owner: 10Urbanecm) [11:09:08] RECOVERY - Check systemd state on ms-be1035 is OK: OK - running: The system is fully operational [11:10:03] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Enable DataBridge (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [11:10:44] !log urbanecm@deploy1001 Synchronized wmf-config/abusefilter.php: [[:gerrit:519767|Add abusefilter-view-private to checkusers on arwiki]] (T226899) (duration: 00m 49s) [11:10:48] kart_, +2'ed your backport, it's not probable they'll end up merged before fetching the other one :) [11:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:49] T226899: Add `abusefilter-view-private` permission to checkuser group on arwiki - https://phabricator.wikimedia.org/T226899 [11:12:23] !log upload facter_3.11.0-2~debu9u2+wmf1 to stretch-wikimedia component/facter3 [11:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:28] !log rolling upgrade of facter3 [11:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:34] Urbanecm: Thanks! [11:12:40] yw kart_ [11:12:58] apergos, your patch is merged [11:13:02] \o/ [11:13:59] apergos, on mwdebug1002, in case it's testable there [11:14:34] yes, in this case it is [11:14:44] give me about 2 minutes to test please [11:14:51] apergos, sure, let me know when you're done [11:15:18] (03PS2) 10Matthias Geisler: Enable DataBridge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) [11:15:42] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/php/7.2/fpm/conf.d/20-wmerrors.ini] [11:16:26] (03CR) 10Matthias Geisler: Enable DataBridge (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [11:16:49] (03PS3) 10Matthias Geisler: Enable DataBridge on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) [11:17:01] kart_, https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/56038/console [11:17:03] this doesn't look good [11:17:28] might be about 3 minutes total to finish up (sorry) [11:17:42] np, I'm still waiting on another backport to merge [11:18:30] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/php/7.2/fpm/conf.d/20-wmerrors.ini] [11:18:41] (03CR) 10Lucas Werkmeister (WMDE): Enable DataBridge on Beta (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [11:18:46] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/php/7.2/fpm/conf.d/20-wmerrors.ini] [11:19:17] Urbanecm: ouch. Checking. [11:19:20] really? all the snapshots? what's the last puppet patches merged? [11:19:23] thank you kart_ [11:19:36] (03CR) 10Lucas Werkmeister (WMDE): Enable DataBridge on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [11:20:50] PROBLEM - puppet last run on mwmaint1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/php/7.2/fpm/conf.d/20-wmerrors.ini] [11:21:29] Urbanecm: looks good to me, you can send it on around [11:21:31] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10Reedy) ` reedy@deploy1001:~$ mwscript eval.php commonswiki > var_dump( $wgHTTPTimeout, $wgHTTPImportTimeout, $wgAsy... [11:21:37] apergos, cool, syncing [11:22:04] jbond42: any chance these wmerrors puppet whines might have to do with facter? [11:22:25] no its not facter, im looking at it now [11:22:28] ah ty [11:22:52] some module expects /etc/php/7.2/fpm to exist but php-fpm is not installed on all hosts using php [11:23:12] no, hosts that just do command line don't need/want it [11:23:36] yes exactly but the code makes an assumption that everything has it just trying to track the specific code path down [11:23:44] gotcha [11:23:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] Introduce an-tool1006 [dns] - 10https://gerrit.wikimedia.org/r/519984 (https://phabricator.wikimedia.org/T226844) (owner: 10Elukey) [11:24:16] PROBLEM - puppet last run on snapshot1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/php/7.2/fpm/conf.d/20-wmerrors.ini] [11:24:23] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.11/includes/: SWAT: [[:gerrit:519953|Join slot and content tables when dumping XML]] (T220493) (duration: 01m 14s) [11:24:28] apergos, synced [11:24:35] awesome, thanks! [11:24:41] yw [11:24:52] kart_, do you have an update? [11:24:59] Urbanecm: seems https://integration.wikimedia.org/ci/job/mediawiki-i18n-check-docker/11537/console failing, not sure what's season there with related to Git. [11:25:08] Urbanecm: no. I can't figure it out yet. [11:25:58] ok kart_, let me know then. We have ~30 mins left of the window. [11:26:10] PROBLEM - puppet last run on mwmaint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/php/7.2/fpm/conf.d/20-wmerrors.ini] [11:26:17] Urbanecm: Give me 5 minutes, else I'll give up :) [11:26:24] sure kart_ [11:26:26] urbanecm@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [11:26:27] T220493: Xml stubs dumps are running 5 to 15x slower than previously - https://phabricator.wikimedia.org/T220493 [11:26:46] thanks stashbot, that was sooo helpful [11:27:20] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.11/includes/: SWAT: [[:gerrit:519953|Join slot and content tables when dumping XML]] (T220493) (duration: 01m 14s) [11:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:25] (trying again to see) [11:27:29] well looky there [11:29:33] (03CR) 10CDanis: [C: 03+1] debian: Release 1.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/519753 (owner: 10Volans) [11:30:02] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/php/7.2/fpm/conf.d/20-wmerrors.ini] [11:30:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] Give scaffold template configuration options for dev purposes (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi) [11:31:14] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:31:16] (03PS5) 10Urbanecm: Clean up wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765) (owner: 10DannyS712) [11:31:19] Urbanecm: giving up. Will schedule later. [11:31:24] ok kart_ [11:31:27] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765) (owner: 10DannyS712) [11:31:32] Urbanecm: You can add 'Not done' in calendar. [11:31:35] will do [11:31:48] PROBLEM - puppet last run on snapshot1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/php/7.2/fpm/conf.d/20-wmerrors.ini] [11:31:54] the above is due to [11:31:56] (/Stage[main]/Profile::Mediawiki::Php/Php::Extension[wmerrors]/File[/etc/php/7.2/fpm/conf.d/20-wmerrors.ini]/ensure) change from 'absent' to 'link' failed: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /etc/php/7.2/fpm/conf.d (file: /etc/puppet/modules/php/manifests/extension.pp, line: 42) [11:32:25] (03Merged) 10jenkins-bot: Clean up wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765) (owner: 10DannyS712) [11:32:26] I don't see a puppet merge in the last few minutes, jijiki, effie, _joe_ FYI [11:32:41] (03CR) 10jenkins-bot: Clean up wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765) (owner: 10DannyS712) [11:34:12] jbond42: did you find anything obvious so far? [11:34:33] I just noticed you were already looking at it before me :) [11:34:36] yes just about to push something one sec [11:34:40] PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/php/7.2/fpm/conf.d/20-wmerrors.ini] [11:36:14] (03PS1) 10Jbond: mediawiki::php: ensure sapis error is correct for wmerrors: [puppet] - 10https://gerrit.wikimedia.org/r/519990 (https://phabricator.wikimedia.org/T187147) [11:36:16] (03PS4) 10Matthias Geisler: Enable DataBridge on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) [11:36:26] volans: https://gerrit.wikimedia.org/r/519990 [11:36:29] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:519509|Clean up wgNamespaceAliases]] (T226765) (duration: 00m 49s) [11:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:39] T226765: Remove redundant aliases for Project_talk - https://phabricator.wikimedia.org/T226765 [11:36:51] !log EU SWAT done [11:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:25] (03PS5) 10Matthias Geisler: Enable DataBridge on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) [11:37:46] jbond42: if it was explicitely set to fpm only instead of the ['cli', 'fpm'] set at the top of the file there might be a reason [11:38:04] (03CR) 10Matthias Geisler: Enable DataBridge on Beta (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [11:38:07] so maybe there should be either [] or ['fpm'] and not ['cli'] or ['cli', fpm'] [11:38:15] but I'm missing context to know better [11:38:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "PCC for Toolforge sais this is fine: https://puppet-compiler.wmflabs.org/compiler1001/17168/" [puppet] - 10https://gerrit.wikimedia.org/r/519398 (owner: 10Alexandros Kosiaris) [11:39:01] volans: yes i agree not sure if its oversigth tobegin with or intentional. i dont think this is causing massive issues so my take is its safe to wait for serviceops to come back from lunch and give context [11:39:54] if you disagree i can create a patch which just dose ['fpm'] or []. push that now and talk about if this is correct when someone elses is abck? [11:41:07] jbond42: was this caused by a previous merge? [11:41:17] yes one sec [11:41:18] how many hosts affected? [11:41:23] https://gerrit.wikimedia.org/r/c/operations/puppet/+/519986 [11:41:32] how easy is to just revert hte previous one? (looking) [11:41:40] snapshot and mwmaint effected [11:41:44] <_joe_> volans: oh ok yeah I did a stupid error [11:41:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] Give scaffold template configuration options for dev purposes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi) [11:41:52] <_joe_> but please don't revert, it's harmless [11:41:56] ack [11:42:09] <_joe_> I just fixed the sapi without an if guard around it [11:42:11] <_joe_> meh [11:42:19] (03PS3) 10Alexandros Kosiaris: kubernetes: Move k8s::infrastructure_config to profile [puppet] - 10https://gerrit.wikimedia.org/r/519398 [11:42:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] kubernetes: Move k8s::infrastructure_config to profile [puppet] - 10https://gerrit.wikimedia.org/r/519398 (owner: 10Alexandros Kosiaris) [11:42:30] <_joe_> it should only happen on the snapshots and maint hosts [11:42:36] yes thats correct [11:42:43] <_joe_> volans: I'll ack the alerts [11:43:09] there is a patch i did here but not sure if wmerrors is needed on cli sapi https://gerrit.wikimedia.org/r/c/operations/puppet/+/519990 [11:43:29] <_joe_> no it definitely isn't [11:43:37] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: thirdparty/kubeadm-k8s: add cri-tools [puppet] - 10https://gerrit.wikimedia.org/r/519991 (https://phabricator.wikimedia.org/T215531) [11:44:12] <_joe_> jbond42: no the change doesn't solve the issue correctly, I'll do it later [11:44:19] ack, ill abandon tnhaty change and leave it to you then :) [11:44:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: thirdparty/kubeadm-k8s: add cri-tools [puppet] - 10https://gerrit.wikimedia.org/r/519991 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [11:44:28] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: thirdparty/kubeadm-k8s: add cri-tools [puppet] - 10https://gerrit.wikimedia.org/r/519991 (https://phabricator.wikimedia.org/T215531) [11:44:42] (03Abandoned) 10Jbond: mediawiki::php: ensure sapis error is correct for wmerrors: [puppet] - 10https://gerrit.wikimedia.org/r/519990 (https://phabricator.wikimedia.org/T187147) (owner: 10Jbond) [11:45:18] <_joe_> acked the alerts, sorry for the noise again :) [11:46:30] no prob [11:46:35] thanks for taking a look [11:47:20] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational [11:56:53] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10ema) >>! In T222356#5295683, @MoritzMuehlenhoff wrote: > The updated package in boron:~jbond/src/facter-3.11.0 looks good. IMO we can ignore to backport this to jessie? the only cp host... [12:01:24] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=codfw [12:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:53] !log depool codfw for kubernetes upgrades. T226256 [12:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:38] PROBLEM - Check systemd state on ms-be1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:09:43] (03CR) 10Muehlenhoff: [C: 03+1] Introduce an-tool1006 [dns] - 10https://gerrit.wikimedia.org/r/519984 (https://phabricator.wikimedia.org/T226844) (owner: 10Elukey) [12:10:04] RECOVERY - Check systemd state on ms-be1028 is OK: OK - running: The system is fully operational [12:12:49] (03CR) 10Nikerabbit: [C: 03+1] Configuration migration for Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517933 (https://phabricator.wikimedia.org/T87985) (owner: 10Awight) [12:16:09] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10jbond) The update has now been rolled out. [12:24:58] PROBLEM - Device not healthy -SMART- on ms-be1028 is CRITICAL: cluster=swift device=cciss,9 instance=ms-be1028:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1028&var-datasource=eqiad+prometheus/ops [12:25:40] RECOVERY - Check systemd state on kubernetes2001 is OK: OK - running: The system is fully operational [12:27:01] (03PS1) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [12:28:08] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [12:28:52] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash and creating grafana boards / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10MoritzMuehlenhoff) Merging two replies here: >>! In T225004#5291688, @WMDE-leszek wrote: >> we have two ways to app... [12:31:34] (03PS2) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [12:32:37] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10fgiunchedi) Thanks @Reedy ! I'd be interested too in finding out what the timeout is, it is also possible the defau... [12:32:39] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [12:35:58] (03PS3) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [12:37:02] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [12:37:10] not my day [12:39:03] (03PS4) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [12:40:08] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [12:40:20] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10Reedy) I honestly don't know how long it takes to upload a 4GB file or so, I've not paid enough attention. Easy eno... [12:41:12] (03PS5) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [12:42:17] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [12:46:34] (03PS6) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [12:49:08] 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10fgiunchedi) >>! In T225059#5295193, @elukey wrote: > @fgiunchedi let me know if the above new metrics (and code if y... [12:49:27] !log repool codfw after kubernetes upgrades. T226256 [12:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:33] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=codfw [12:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:04] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10Reedy) Well, this is fun.. :) Closed T226845 as done now ` reedy@mwmaint1002:/tmp/uploads$ date && time mwscript i... [12:51:47] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=eqiad [12:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:56] !log depool eqiad for kubernetes upgrades. T226256 [12:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:24] RECOVERY - Device not healthy -SMART- on ms-be1028 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1028&var-datasource=eqiad+prometheus/ops [12:55:50] (03CR) 10Fsero: [V: 03+2 C: 03+2] introducing helmfile.d values for staging cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/517887 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [12:57:32] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10MoritzMuehlenhoff) >>! In T222356#5295760, @jbond wrote: >> One thing to consider for buster: So far we've used the facter version in buster, so I think we have two options: > > I would... [12:58:34] (03PS7) 10Fsero: k8s, deploy: introducing helmfile for manage charts [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) [12:58:57] query [12:59:01] nope :) [12:59:49] (03CR) 10Elukey: [C: 03+2] Introduce an-tool1006 [dns] - 10https://gerrit.wikimedia.org/r/519984 (https://phabricator.wikimedia.org/T226844) (owner: 10Elukey) [13:00:22] (03CR) 10Fsero: [C: 03+2] k8s, deploy: introducing helmfile for manage charts [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [13:01:17] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10Reedy) Is there any way to check if swift was just "busier" around the times the uploads were done? To see whether... [13:06:56] (03PS1) 10Alexandros Kosiaris: Skip ganeti[12]009 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/520002 [13:07:34] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[helmfile],Package[helm-diff] [13:07:52] (03PS5) 10Arturo Borrero Gonzalez: toolforge: k8s: introduce basic config file for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [13:08:16] (03Abandoned) 10Alexandros Kosiaris: Specify keyholder_key in global scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/380503 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [13:13:55] (03PS2) 10Fsero: registry: improving swift replication [puppet] - 10https://gerrit.wikimedia.org/r/519018 [13:14:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] Skip ganeti[12]009 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/520002 (owner: 10Alexandros Kosiaris) [13:14:22] PROBLEM - High CPU load on API appserver on mw1340 is CRITICAL: CRITICAL - load average: 76.47, 35.80, 22.35 [13:15:19] (03CR) 10Bstorm: toolforge: k8s: introduce basic config file for kubeadm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [13:15:48] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10Volans) Not yet as the script has clearly not been tested: ` $ sudo cookbook sre.ganeti.makevm -h Exception raised while... [13:15:52] RECOVERY - High CPU load on API appserver on mw1340 is OK: OK - load average: 24.91, 28.85, 21.10 [13:15:57] (03PS6) 10Arturo Borrero Gonzalez: toolforge: k8s: introduce basic config file for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [13:19:41] (03CR) 10Hoo man: "Alternatively we could use export (but I'm not a fan of that), or even not use a subshell there (but that would mean that we need to gzip " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519494 (https://phabricator.wikimedia.org/T226601) (owner: 10Hoo man) [13:20:13] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=eqiad [13:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:21] !log repool eqiad after kubernetes upgrades. T226256 [13:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:26] PROBLEM - Check the last execution of git_pull_charts on contint1001 is CRITICAL: NRPE: Command check_check_git_pull_charts_status not defined [13:26:22] (03PS7) 10Bstorm: toolforge: k8s: introduce basic config file for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) [13:31:02] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: only install wmerrors where fpm is present [puppet] - 10https://gerrit.wikimedia.org/r/520009 [13:32:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: only install wmerrors where fpm is present [puppet] - 10https://gerrit.wikimedia.org/r/520009 (owner: 10Giuseppe Lavagetto) [13:33:05] (03CR) 10ArielGlenn: [C: 03+1] "Can merge whenever you like. Note that this week's json run is already underway." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519494 (https://phabricator.wikimedia.org/T226601) (owner: 10Hoo man) [13:34:10] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[helmfile],Package[helm-diff] [13:35:06] (03CR) 10Marostegui: "> > Patch Set 10:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [13:35:45] (03PS3) 10Ottomata: Add 'Z' suffix to webrequest log dt format [puppet] - 10https://gerrit.wikimedia.org/r/516528 (https://phabricator.wikimedia.org/T217040) [13:37:19] <_joe_> fsero: contint1001 is a jessie host [13:37:41] <_joe_> same for contint2001 AFAICS [13:37:49] there's a task for upgrading it though [13:37:55] <_joe_> so you probably need to re-compile helmfile there [13:38:01] (03CR) 10Ottomata: [C: 03+2] Add 'Z' suffix to webrequest log dt format [puppet] - 10https://gerrit.wikimedia.org/r/516528 (https://phabricator.wikimedia.org/T217040) (owner: 10Ottomata) [13:38:04] <_joe_> akosiaris: yeah but puppet is failing there [13:38:10] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:38:11] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:22] maybe remove the profile then? [13:38:28] RECOVERY - puppet last run on ganeti2009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:38:33] sounds less needless work [13:38:36] !log draining restbase2009 for eventual reboot for MDS kernel updates [13:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:24] RECOVERY - puppet last run on snapshot1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:40:18] <_joe_> akosiaris: fine by me if it's not needed there [13:40:32] err [13:40:37] is a go application [13:40:47] can i just send the same package to jessie? [13:40:58] ah, yeah we can just use reprepro copy then [13:41:01] even fastert [13:41:06] faster* [13:41:09] (03PS1) 10Elukey: sre.ganeti.makevm: add the possibility to choose link analytics [cookbooks] - 10https://gerrit.wikimedia.org/r/520011 (https://phabricator.wikimedia.org/T203963) [13:41:15] doing [13:41:36] !log uploading helmfile to jessie as well [13:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:56] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:42:33] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10Performance-Team (Radar), and 2 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10BBlack) p:05Normalβ†’03High Re-setting this to at le... [13:42:53] !log rolling update of expat [13:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:09] !log modified dt format of webrequest logs to use 'Z' suffix for timezone offset - T217040 [13:43:10] (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.makevm: add the possibility to choose link analytics [cookbooks] - 10https://gerrit.wikimedia.org/r/520011 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [13:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:16] T217040: Add UTC 'Z' suffix to webrequest `dt` field. - https://phabricator.wikimedia.org/T217040 [13:44:36] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:44:49] (03CR) 10Jcrespo: "> > Agreed this would simplify the puppet side of things quite a bit!" [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [13:44:52] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:45:09] (03PS1) 10Filippo Giunchedi: rsyslog: use named actions for central syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/520012 (https://phabricator.wikimedia.org/T226703) [13:45:18] 10Operations, 10observability, 10serviceops, 10PHP 7.2 support, and 2 others: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm. - https://phabricator.wikimedia.org/T223336 (10Joe) As I explained in T187147#5295715, my understanding is that in case of a segfault php-fpm fa... [13:46:54] RECOVERY - puppet last run on mwmaint1002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:49:08] 10Operations, 10UI-Standardization: Use white version of Wikimedia logo for grafana - https://phabricator.wikimedia.org/T226970 (10Ladsgroup) So you need to inject this custom css: `lang=css body:not(.sidemenu-open) .sidemenu__logo img, .sidemenu__logo img:hover { object-position: -99999px 99999px; bac... [13:50:10] (03PS2) 10Elukey: sre.ganeti.makevm: add the possibility to choose link analytics [cookbooks] - 10https://gerrit.wikimedia.org/r/520011 (https://phabricator.wikimedia.org/T203963) [13:50:42] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10Urbanecm) Tried to measure how long it takes to upload a 3 GB file. Well, `importImages.php` took ~5.5 mins and end... [13:51:34] (03PS3) 10Elukey: sre.ganeti.makevm: add the possibility to choose link analytics [cookbooks] - 10https://gerrit.wikimedia.org/r/520011 (https://phabricator.wikimedia.org/T203963) [13:52:02] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10Urbanecm) >>! In T226937#5296202, @Reedy wrote: > Is there any way to check if swift was just "busier" around the t... [13:52:55] systemd.timers in jessie does not have randomizedDelaySecs options and is making the helmfile change fail in contint [13:53:13] (03CR) 10Bstorm: toolforge: k8s: introduce basic config file for kubeadm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [13:53:24] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:53:25] i'll acknowledge criticals and will upload a patch but given is going to change the systemd timer job template id prefer to do it calmly [13:53:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:33] ^^ _joe_ akosiaris [13:54:24] ACKNOWLEDGEMENT - Check the last execution of git_pull_charts on contint1001 is CRITICAL: NRPE: Command check_check_git_pull_charts_status not defined Fsero Failing due to systemd.timer on jessie does not support RandomizedSecsDelay [13:54:24] ACKNOWLEDGEMENT - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 15 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[helmfile],Package[helm-diff] Fsero Failing due to systemd.timer on jessie does not support RandomizedSecsDelay [13:54:24] ACKNOWLEDGEMENT - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[git_pull_charts.timer] Fsero Failing due to systemd.timer on jessie does not support RandomizedSecsDelay [13:54:40] !log draining restbase2010 for eventual reboot for MDS kernel updates [13:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:32] fsero: sigh, ok [13:56:52] RECOVERY - puppet last run on mwmaint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:57:03] (03CR) 10Herron: [C: 03+1] rsyslog: use named actions for central syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/520012 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [14:01:00] (03CR) 10Herron: [C: 03+1] "LGTM overall! One minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) (owner: 10Filippo Giunchedi) [14:02:02] (03PS3) 10Herron: Revert "kafka-main2001: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/519476 [14:02:22] RECOVERY - puppet last run on snapshot1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:02:28] !log draining restbase2011 for eventual reboot for MDS kernel updates [14:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:05] (03PS1) 10Paladox: Gerrit: Wait 5s before starting [puppet] - 10https://gerrit.wikimedia.org/r/520016 [14:03:30] (03PS2) 10Paladox: Gerrit: Wait 5s before starting [puppet] - 10https://gerrit.wikimedia.org/r/520016 [14:04:10] (03PS1) 10Ladsgroup: Revert "grafana: Make the wikimedia logo white" [puppet] - 10https://gerrit.wikimedia.org/r/520017 [14:04:13] (03CR) 10Marostegui: "This will be a great simplification and make us less error prone when provisioning/moving hosts :-)" [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [14:04:32] RECOVERY - Check the last execution of git_pull_charts on contint1001 is OK: OK: Status of the systemd unit git_pull_charts [14:05:04] PROBLEM - Check systemd state on ms-be1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:05:14] (03PS1) 10Fsero: bug: systemd.timers does not support random on jessie [puppet] - 10https://gerrit.wikimedia.org/r/520018 [14:06:02] (03CR) 10jerkins-bot: [V: 04-1] bug: systemd.timers does not support random on jessie [puppet] - 10https://gerrit.wikimedia.org/r/520018 (owner: 10Fsero) [14:07:14] (03PS1) 10Ottomata: Produce centralnotice.campaign-* streams to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520019 (https://phabricator.wikimedia.org/T211248) [14:07:31] (03CR) 10Herron: [C: 03+2] Revert "kafka-main2001: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/519476 (owner: 10Herron) [14:08:27] (03CR) 10Bstorm: toolforge: k8s: introduce basic config file for kubeadm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [14:08:58] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:09:01] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:20] !log rolling reboot of docker registry nodes to pick up MDS-enabled qemu [14:10:23] (03PS1) 10Ema: cache: reimage cp2017 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520021 (https://phabricator.wikimedia.org/T226637) [14:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:52] (03CR) 10Thcipriani: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/520016 (owner: 10Paladox) [14:16:05] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp2017 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520021 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [14:17:33] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10elukey) ping :) [14:22:34] Amir1: no joy with the grafana logo outline heh? [14:23:01] no :(( I have been wrestling with it for hours now [14:23:52] Amir1: my two cents without knowing anything more, what about a version of the svg that's outlined? [14:23:57] s/more// [14:24:03] !log draining restbase2012 for eventual reboot for MDS kernel updates [14:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:36] !log depool cp2017 and reimage as upload_ats T226637 [14:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:42] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [14:26:11] godog: I can give it a try [14:26:18] it should not be super hard [14:26:23] (03CR) 10Ema: [C: 03+2] cache: reimage cp2017 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520021 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [14:27:01] I _think_ it should work, both when hovering the logo and when the sidebar is hidden [14:27:55] yeah, the big problem is different backgrounds and different sizes [14:28:26] so for example, if I make the outline color black, it would look weird in login page [14:28:27] https://grafana.wikimedia.org/login?redirect=%2F [14:30:20] (03CR) 10Filippo Giunchedi: logstash: add consumer for client errors (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) (owner: 10Filippo Giunchedi) [14:30:26] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2017.codfw.wmnet'] ` The log can be found in `... [14:31:21] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10elukey) @Nuria, do you think that we could work on this during the next couple of months? Seems to be an easy enough change to be ready in... [14:31:25] Amir1: ah yeah fair enough, ok I'll go ahead with your revert for now [14:31:47] Thanks. Sorry :( [14:31:58] (03CR) 10Jcrespo: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [14:32:28] Amir1: np! thanks for working on it [14:32:52] (03PS2) 10Filippo Giunchedi: Revert "grafana: Make the wikimedia logo white" [puppet] - 10https://gerrit.wikimedia.org/r/520017 (owner: 10Ladsgroup) [14:32:59] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "grafana: Make the wikimedia logo white" [puppet] - 10https://gerrit.wikimedia.org/r/520017 (owner: 10Ladsgroup) [14:36:52] (03PS4) 10Filippo Giunchedi: logstash: add consumer for client errors [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) [14:38:12] 10Operations, 10observability, 10serviceops, 10PHP 7.2 support, and 2 others: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm. - https://phabricator.wikimedia.org/T223336 (10Joe) Using a modified version of `furl` that now supports unix sockets, for segfaults I get: `lan... [14:40:56] RECOVERY - Check systemd state on ms-be1034 is OK: OK - running: The system is fully operational [14:41:53] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:41:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:19] !log draining restbase2013 for eventual reboot for MDS kernel updates [14:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:57] (03CR) 10Ottomata: "Oh! Is there a logging kafka beta cluster? Will do what are the brokers?" [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) (owner: 10Filippo Giunchedi) [14:48:53] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:49:03] (03CR) 10Volans: [C: 03+2] "LGTM, thanks for the fix!" [cookbooks] - 10https://gerrit.wikimedia.org/r/520011 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [14:49:53] (03PS1) 10Giuseppe Lavagetto: furl: support connecting to unix sockets [puppet] - 10https://gerrit.wikimedia.org/r/520025 [14:49:59] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:16] (03CR) 10Filippo Giunchedi: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) (owner: 10Filippo Giunchedi) [14:50:35] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10Nuria) Let's plan this for this quarter then? (q2?) [14:50:43] (03Merged) 10jenkins-bot: sre.ganeti.makevm: add the possibility to choose link analytics [cookbooks] - 10https://gerrit.wikimedia.org/r/520011 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [14:50:49] !log installing openjdk-8 security updates on stretch-based restbase hosts [14:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:11] (03PS2) 10Fdans: ReportUpdater: change repo of all queries to reportupdater-queries [puppet] - 10https://gerrit.wikimedia.org/r/517085 (https://phabricator.wikimedia.org/T222739) [14:52:16] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/17173/" [puppet] - 10https://gerrit.wikimedia.org/r/520012 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [14:52:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] bug: systemd.timers does not support random on jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520018 (owner: 10Fsero) [14:54:42] !log draining restbase2014 for eventual reboot for MDS kernel updates [14:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:07] PROBLEM - HHVM jobrunner on mw2280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:59:14] (03CR) 10Krinkle: furl: support connecting to unix sockets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520025 (owner: 10Giuseppe Lavagetto) [14:59:26] 10Operations, 10Diffusion, 10Packaging, 10Patch-For-Review, and 4 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10jbond) @MoritzMuehlenhoff The updated package has broken some dependencies which is causing an error on... [15:00:19] RECOVERY - HHVM jobrunner on mw2280 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [15:01:10] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10elukey) Almost! ` usage: cookbook [-h] [--vcpus VCPUS] [--memory MEMORY] [--disk DISK] [--link {public,... [15:01:32] (03PS2) 10Filippo Giunchedi: rsyslog: use named actions for central syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/520012 (https://phabricator.wikimedia.org/T226703) [15:01:46] ping ottomata standupo [15:02:13] (03PS7) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [15:03:16] 10Operations, 10Diffusion, 10Packaging, 10Patch-For-Review, and 4 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10MoritzMuehlenhoff) openssh-sftp-server also built from src:openssh, I've also installed the SFTP packag... [15:03:30] (03PS2) 10Fsero: bug: systemd.timers does not support random on jessie [puppet] - 10https://gerrit.wikimedia.org/r/520018 [15:03:57] (03CR) 10Cwhite: [C: 03+2] grafana: fix and update to grafana-dashboard script [puppet] - 10https://gerrit.wikimedia.org/r/519662 (owner: 10Cwhite) [15:04:05] (03PS3) 10Cwhite: grafana: fix and update to grafana-dashboard script [puppet] - 10https://gerrit.wikimedia.org/r/519662 [15:04:18] (03CR) 10jerkins-bot: [V: 04-1] bug: systemd.timers does not support random on jessie [puppet] - 10https://gerrit.wikimedia.org/r/520018 (owner: 10Fsero) [15:04:32] 10Operations, 10Diffusion, 10Packaging, 10Patch-For-Review, and 4 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) @moritzmuehlenhoff: cool! FWIW the new package seems to have fixed the problem and I haven't n... [15:04:45] (03PS1) 10Elukey: sre.ganeti.makevm: move split away from argparse [cookbooks] - 10https://gerrit.wikimedia.org/r/520028 (https://phabricator.wikimedia.org/T203963) [15:05:41] (03PS8) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [15:11:47] !log draining restbase2015 for eventual reboot for MDS kernel updates [15:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:05] RECOVERY - puppet last run on phab1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:14:23] <_joe_> fsero: akosiaris gave you the wrong suggestion :P [15:14:44] do you have the right one? [15:15:12] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:15:14] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:26] (03PS8) 10Bstorm: toolforge: k8s: introduce basic config file for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) [15:15:48] <_joe_> one sec, maybe I remember incorrectly [15:16:50] I did ? [15:16:54] <_joe_> yeah your problem is we need to set up lsb facts correctly in tests [15:17:29] <_joe_> akosiaris: nope you didn't, I saw the -1 from CI and assumed you told him to use scope.call_function [15:18:12] I did [15:19:00] fsero: I wrapped my head around incubator/raw btw. Nice! [15:19:05] !log draining restbase2016 for eventual reboot for MDS kernel updates [15:19:07] (03PS9) 10Arturo Borrero Gonzalez: toolforge: k8s: introduce basic config file for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:09] (03CR) 10Bstorm: [C: 03+2] toolforge: k8s: introduce basic config file for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:19:12] good idea using it [15:19:19] <_joe_> no, you gave him the correct suggestion :) [15:19:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: k8s: introduce basic config file for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:19:56] so _joe_ akosiaris the problem is the test docker image does not have lsb_release package? [15:20:11] <_joe_> fsero: no the problem is the spec tests lack it [15:20:16] akosiaris: ty now we should use it for real, that means recreate the staging cluster [15:20:29] (03CR) 10Ladsgroup: [C: 03+1] dumpwikidatajson: Fix error code detection [puppet] - 10https://gerrit.wikimedia.org/r/519494 (https://phabricator.wikimedia.org/T226601) (owner: 10Hoo man) [15:20:31] (03CR) 10Bstorm: [C: 03+2] toolforge: k8s: introduce basic config file for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:20:35] (03CR) 10Ladsgroup: [C: 03+1] Wikidata dumps: Update minimum expected sizes [puppet] - 10https://gerrit.wikimedia.org/r/519493 (https://phabricator.wikimedia.org/T226601) (owner: 10Hoo man) [15:20:53] fsero: yup, agreed [15:21:38] <_joe_> fsero: ok I found the issue, I'll fix it and you can rebase on top of my change [15:22:15] (03PS9) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [15:23:03] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [15:24:39] (03CR) 10Volans: [C: 03+2] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/520028 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [15:24:56] thx _joe_ ! [15:25:22] (03CR) 10Cwhite: [C: 03+2] grafana: update varnish-aggregate-client-status-codes to prometheus version [puppet] - 10https://gerrit.wikimedia.org/r/519664 (https://phabricator.wikimedia.org/T184942) (owner: 10Cwhite) [15:25:24] (03CR) 10Ottomata: [C: 03+1] "Great! Let me know when you are ready to merge." [puppet] - 10https://gerrit.wikimedia.org/r/517085 (https://phabricator.wikimedia.org/T222739) (owner: 10Fdans) [15:25:31] (03PS2) 10Cwhite: grafana: update varnish-aggregate-client-status-codes to prometheus version [puppet] - 10https://gerrit.wikimedia.org/r/519664 (https://phabricator.wikimedia.org/T184942) [15:25:36] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash and creating grafana boards / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10WMDE-leszek) Re being in WMDE group being equivalent to being in NDA group, just for the record: there are/were memb... [15:26:17] (03Merged) 10jenkins-bot: sre.ganeti.makevm: move split away from argparse [cookbooks] - 10https://gerrit.wikimedia.org/r/520028 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [15:27:19] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [15:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:37] (03PS1) 10Jbond: admin: update aacraze ssh key [puppet] - 10https://gerrit.wikimedia.org/r/520032 [15:27:54] chaomodus ^^ we got a tester ;) [15:27:54] !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [15:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:49] (03PS2) 10Jbond: admin: update aacraze ssh key [puppet] - 10https://gerrit.wikimedia.org/r/520032 [15:29:49] (03PS1) 10Elukey: sre.ganet.makevm: add info about chosen link before creation [cookbooks] - 10https://gerrit.wikimedia.org/r/520033 (https://phabricator.wikimedia.org/T203963) [15:29:57] PROBLEM - Check systemd state on ms-be1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:30:01] volans: don't hate me --^ :D [15:30:06] this one is a nitpick [15:30:44] (03CR) 10Jbond: [C: 03+2] admin: update aacraze ssh key [puppet] - 10https://gerrit.wikimedia.org/r/520032 (owner: 10Jbond) [15:30:44] what failed? [15:30:59] (03PS2) 10Giuseppe Lavagetto: furl: support connecting to unix sockets [puppet] - 10https://gerrit.wikimedia.org/r/520025 [15:31:01] I make it fail, see the code change [15:31:01] (03PS1) 10Giuseppe Lavagetto: systemd::timer: use OS facts in tests [puppet] - 10https://gerrit.wikimedia.org/r/520034 [15:31:09] !log draining restbase2017 for eventual reboot for MDS kernel updates [15:31:10] I wanted to run it with a complete recap [15:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:28] ah you stopped it when it asked for confirmation? [15:31:30] (I entered "no" three times in a row when asked to proceed) [15:31:33] got it :) [15:31:44] (03CR) 10Volans: [C: 03+2] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/520033 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [15:32:06] (03PS3) 10Filippo Giunchedi: rsyslog: use named actions for central syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/520012 (https://phabricator.wikimedia.org/T226703) [15:33:37] (03Merged) 10jenkins-bot: sre.ganet.makevm: add info about chosen link before creation [cookbooks] - 10https://gerrit.wikimedia.org/r/520033 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [15:35:05] RECOVERY - Check systemd state on ms-be1034 is OK: OK - running: The system is fully operational [15:36:51] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [15:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:04] !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [15:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:11] (03CR) 10Filippo Giunchedi: "latest PCC https://puppet-compiler.wmflabs.org/compiler1002/17175/" [puppet] - 10https://gerrit.wikimedia.org/r/520012 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [15:38:14] (03CR) 10Kosta Harlan: [C: 03+1] Setup EditorJourney for Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519581 (https://phabricator.wikimedia.org/T225737) (owner: 10Urbanecm) [15:39:43] (03CR) 10Kosta Harlan: [C: 03+1] "Registered for morning SWAT today. Thanks Urbanecm!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519581 (https://phabricator.wikimedia.org/T225737) (owner: 10Urbanecm) [15:40:17] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10fdans) p:05Triageβ†’03High [15:42:01] !log draining restbase2018 for eventual reboot for MDS kernel updates [15:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:20] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2017.codfw.wmnet'] ` and were **ALL** successful. [15:42:27] (03PS10) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [15:43:24] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [15:45:01] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10elukey) From an old task to create an-tool1005 (https://phabricator.wikimedia.org/T217738): ` sudo gnt-instance add -t... [15:46:58] 10Operations, 10Analytics, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10fdans) p:05Triageβ†’03Normal [15:47:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10fdans) [15:47:14] (03PS11) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [15:47:30] 10Operations, 10Analytics, 10Patch-For-Review, 10Security, and 2 others: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) [15:47:34] chaomodus: when you've time could you have a look with elukey at what went wrong above? (ganeti makevm) [15:47:47] yep [15:48:47] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10akosiaris) [15:49:29] thanks! :) [15:49:52] I don't recall if we show the cumin's output to the user or log it [15:50:18] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10akosiaris) [15:50:29] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10akosiaris) 05Openβ†’03Resolved All hosts are installed. They will be added to the clusters in a different task. @papaul, thanks! [15:51:53] volans: I found some logs but none about what the command emitted on ganeti1001.. but I might have checked in the wrong places [15:52:05] what I want to do is execute the command on ganeti1001 and see what emits [15:52:19] or do you guys prefer differntly? [15:52:58] volans, chaomodus --^ [15:53:11] it should have executed it on the current ganeti master node shouldn't it have? [15:53:46] yep yep, but in /var/log/spicerack/sre/ganeti/etc.. I don't find anything returned by cumin about why it failed [15:54:11] oic [15:54:22] yah give it a go on the target host (but I think it's ganeti1003 not 1001) [15:55:53] chaomodus: it is 1001 afaics [15:56:22] it's dynamic, but as soon as you run a command it tells you which one it's the real one ;) [15:56:31] or you can write a 3 line cookbook that tells you that :-P [15:56:36] you are correct elukey [15:56:47] yes i understand thhat it's dynamic, it was 1003 for a long time :) [15:59:31] i'm glad it's correctly selecting the master node in the cookbook in any case :) [16:01:06] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10MoritzMuehlenhoff) @fgiunchedi Shall we close this task? the current jessie package is rolled out and it's not very likely we'll upgrade j... [16:01:10] !log pool cp2017 w/ ATS backend T226637 [16:01:13] lol :) [16:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:15] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [16:05:42] (03PS2) 10Jcrespo: WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 [16:05:49] (03CR) 10jerkins-bot: [V: 04-1] WMFReplication: Make move work for a limited number of cases [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo) [16:16:31] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10MoritzMuehlenhoff) >>! In T226236#5291604, @hashar wrote: > Ah eventually I found the entry: > > ` > Name: thirdparty/k... [16:16:33] (03PS1) 10Cwhite: grafana: remove uid field from dashboard [puppet] - 10https://gerrit.wikimedia.org/r/520039 [16:17:25] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10MoritzMuehlenhoff) a:03Gehel [16:19:29] (03PS1) 10Cwhite: grafana: update script to remove uid field from dashboard [puppet] - 10https://gerrit.wikimedia.org/r/520040 [16:20:39] (03CR) 10Jbond: "updated PCC now that labs/private has been updated" [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [16:27:22] (03CR) 10CDanis: "I'm not sure that removing UID is the way to go here." [puppet] - 10https://gerrit.wikimedia.org/r/520039 (owner: 10Cwhite) [16:34:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable DataBridge on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [16:42:03] 10Operations, 10observability, 10serviceops: Gather metrics on request status codes, latencies from the MediaWiki appservers - https://phabricator.wikimedia.org/T226815 (10colewhite) @Joe afaik, we're using mtail for this kind of metrics gathering. Some examples are the [[ https://github.com/wikimedia/puppe... [16:48:17] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: handle common kubeadm preflight checks [puppet] - 10https://gerrit.wikimedia.org/r/520043 (https://phabricator.wikimedia.org/T215531) [16:49:58] (03PS1) 10Elukey: sre.ganeti.makevm: fix create vm command [cookbooks] - 10https://gerrit.wikimedia.org/r/520044 (https://phabricator.wikimedia.org/T203963) [16:50:20] (03CR) 10Bstorm: [C: 03+1] "Looks good...and we'll need to rebuild the VM :-D" [puppet] - 10https://gerrit.wikimedia.org/r/520043 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [16:50:36] (03PS3) 10Alexandros Kosiaris: phabricator: Allow admins to silence maniphest bulk jobs via sudo [puppet] - 10https://gerrit.wikimedia.org/r/517140 (owner: 1020after4) [16:50:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] phabricator: Allow admins to silence maniphest bulk jobs via sudo [puppet] - 10https://gerrit.wikimedia.org/r/517140 (owner: 1020after4) [16:50:46] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: handle common kubeadm preflight checks [puppet] - 10https://gerrit.wikimedia.org/r/520043 (https://phabricator.wikimedia.org/T215531) [16:51:26] chaomodus: --^ [16:52:15] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) [16:53:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: handle common kubeadm preflight checks [puppet] - 10https://gerrit.wikimedia.org/r/520043 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [16:54:22] (03PS1) 10Bstorm: toolforge: fix the kubeadm config a bit for the master based on testing [puppet] - 10https://gerrit.wikimedia.org/r/520045 [16:54:40] 10Operations, 10ops-codfw: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10Papaul) p:05Triageβ†’03Normal [16:54:49] (03CR) 10jerkins-bot: [V: 04-1] toolforge: fix the kubeadm config a bit for the master based on testing [puppet] - 10https://gerrit.wikimedia.org/r/520045 (owner: 10Bstorm) [16:57:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM +1" [puppet] - 10https://gerrit.wikimedia.org/r/520045 (owner: 10Bstorm) [16:59:35] (03PS4) 10Alexandros Kosiaris: phabricator: Allow admins to silence maniphest bulk jobs via sudo [puppet] - 10https://gerrit.wikimedia.org/r/517140 (owner: 1020after4) [16:59:40] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] phabricator: Allow admins to silence maniphest bulk jobs via sudo [puppet] - 10https://gerrit.wikimedia.org/r/517140 (owner: 1020after4) [17:00:05] gehel and onimisionipe: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190701T1700). [17:00:15] jouncebot: here I am [17:02:05] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10elukey) @akosiaris I know that today I asked you 1000 questions about ganeti, but if you could review the diff between d... [17:04:28] (03PS2) 10Arturo Borrero Gonzalez: toolforge: fix the kubeadm config a bit for the master based on testing [puppet] - 10https://gerrit.wikimedia.org/r/520045 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [17:06:50] (03PS3) 10Arturo Borrero Gonzalez: toolforge: fix the kubeadm config a bit for the master based on testing [puppet] - 10https://gerrit.wikimedia.org/r/520045 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [17:07:39] looks like no deploy package is ready for WDQS, might be rescheduled slightly later or canceled completely [17:08:02] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10akosiaris) >>! In T203963#5297177, @elukey wrote: > @akosiaris I know that today I asked you 1000 questions about ganeti... [17:08:24] (03CR) 10Bstorm: [C: 03+2] "Thanks for fixing the commit msg :)" [puppet] - 10https://gerrit.wikimedia.org/r/520045 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [17:22:08] (03CR) 10Elukey: [C: 03+2] "Based on What Alex wrote in the task, I think it is safe to merge" [cookbooks] - 10https://gerrit.wikimedia.org/r/520044 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [17:22:57] (03CR) 10CRusnov: [C: 03+1] "Looks good to me, guess it was a misread / typo from the original script." [cookbooks] - 10https://gerrit.wikimedia.org/r/520044 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [17:23:10] jouncebot, next [17:23:10] In 0 hour(s) and 36 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190701T1800) [17:23:42] (03Merged) 10jenkins-bot: sre.ganeti.makevm: fix create vm command [cookbooks] - 10https://gerrit.wikimedia.org/r/520044 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [17:27:23] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [17:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:51] (03PS2) 10Thcipriani: blubberoid: Add policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/517573 (https://phabricator.wikimedia.org/T215319) [17:30:59] (03CR) 10Thcipriani: blubberoid: Add policy file (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517573 (https://phabricator.wikimedia.org/T215319) (owner: 10Thcipriani) [17:46:12] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:46:35] (03PS22) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [17:46:56] that's refinery-sqoop-whole-mediawiki.service elukey, ottomata [17:46:59] ^^^ [17:48:02] subprocess.CalledProcessError: Command '['sqoop', 'import'.... returned non-zero exit status 1 [17:48:24] (03PS23) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [17:48:28] ah lovely, we'll get an alarm soon in analytics then [17:48:28] a bunch of those stacktraces in the logs [17:48:48] it is our monthly job to pull data for mediawiki wikis [17:49:05] there is a blob saying jobs to re-run, are all hiwikisource related [17:51:36] yep yep [17:51:41] my team is going to take care of it [17:52:53] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10elukey) After launching the cookbook, it got stuck: ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_A an-to... [17:53:02] thanks! [17:55:02] (03CR) 10Ottomata: "Ok, done. You'll need to change the kafka cluster for the logstash importer then." [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) (owner: 10Filippo Giunchedi) [18:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190701T1800). [18:00:05] RoanKattouw and kostajh: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:14] here [18:02:19] here [18:02:28] Will do SWAT in ~15 if no one else picks it up [18:05:09] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10elukey) Nevermind, found the following in /var/log/ganeti on ganeti1001: ` 2019-07-01 18:03:52,655: job-646757 pid=7552... [18:08:59] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10crusnov) Interesting, it sure does take a while for the disk to build, and the tool will wait. [18:09:17] OK I'm back, I'll do the SWAT [18:09:40] (03PS2) 10Catrope: Setup EditorJourney for Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519581 (https://phabricator.wikimedia.org/T225737) (owner: 10Urbanecm) [18:09:49] (03CR) 10Catrope: [C: 03+2] Setup EditorJourney for Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519581 (https://phabricator.wikimedia.org/T225737) (owner: 10Urbanecm) [18:10:51] (03Merged) 10jenkins-bot: Setup EditorJourney for Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519581 (https://phabricator.wikimedia.org/T225737) (owner: 10Urbanecm) [18:11:11] (03CR) 10jenkins-bot: Setup EditorJourney for Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519581 (https://phabricator.wikimedia.org/T225737) (owner: 10Urbanecm) [18:13:19] kostajh: Should I put the arwikiEditorJourney patch on mwdebug1002 first, or should I deploy it right away? Is there any meaningful testing you could do on mwdebug1002? [18:14:06] RoanKattouw: I can create an account and verify the events [18:14:39] kostajh: OK, it's on there, so go ahead [18:14:45] RoanKattouw: doing [18:19:33] RoanKattouw, if you have time after your part of SWAT is done, please let me know, would like to deploy something. [18:19:44] Sure [18:20:31] thanks :) [18:21:26] RoanKattouw: it looks good to me and Nettrom confirmed in #wikimedia-analytics as well [18:21:35] RoanKattouw, btw, you probably have incorrect task number in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/519080 ;) [18:21:37] Cool, deploying sitewide [18:21:58] Urbanecm: Unfortunately I think I do have the correct number, in that there isn't a better-suited task [18:23:02] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable EditorJourney on arwiki (T225737) (duration: 00m 49s) [18:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:09] T225737: Setup EditorJourney for Arabic Wikipedia - https://phabricator.wikimedia.org/T225737 [18:23:32] okay RoanKattouw [18:24:02] (03PS4) 10Catrope: Enable GrowthExperiments homepage on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519080 (https://phabricator.wikimedia.org/T218237) [18:24:55] (03CR) 10Catrope: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519080 (https://phabricator.wikimedia.org/T218237) (owner: 10Catrope) [18:24:59] (03CR) 10Catrope: [C: 03+2] Enable GrowthExperiments homepage on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519080 (https://phabricator.wikimedia.org/T218237) (owner: 10Catrope) [18:26:00] (03Merged) 10jenkins-bot: Enable GrowthExperiments homepage on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519080 (https://phabricator.wikimedia.org/T218237) (owner: 10Catrope) [18:26:42] (03CR) 10jenkins-bot: Enable GrowthExperiments homepage on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519080 (https://phabricator.wikimedia.org/T218237) (owner: 10Catrope) [18:29:22] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Special:Homepage on viwiki (T218237) (duration: 00m 49s) [18:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:27] T218237: Homepage: setup mentors signup wiki pages - https://phabricator.wikimedia.org/T218237 [18:29:40] (03PS2) 10Cwhite: grafana: update script to manage certain fields [puppet] - 10https://gerrit.wikimedia.org/r/520040 [18:31:21] Urbanecm: Go ahead and deploy your thing, I'm waiting for confirmation from Marshall/Morten before deploying my last patch [18:31:30] ok, thanks RoanKattouw [18:32:00] waiting for CI (ContentTranslation, will take a while) [18:32:13] (03PS1) 10Bstorm: toolsforge: fix up the kubeadm config a bit more and configure docker [puppet] - 10https://gerrit.wikimedia.org/r/520052 (https://phabricator.wikimedia.org/T215531) [18:35:09] (03PS2) 10Cwhite: grafana: remove id field from dashboard and add cooresponding uid [puppet] - 10https://gerrit.wikimedia.org/r/520039 [18:35:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [18:35:37] (03CR) 10Cwhite: "updated to incorporate offline discussion items with @CDanis" [puppet] - 10https://gerrit.wikimedia.org/r/520039 (owner: 10Cwhite) [18:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:29] (03CR) 10Cwhite: "updated to incorporate feedback from @cdanis" [puppet] - 10https://gerrit.wikimedia.org/r/520040 (owner: 10Cwhite) [18:36:40] \o/ elukey! thanks a lot for fixing it [18:37:32] (03CR) 10CDanis: [C: 03+1] grafana: remove id field from dashboard and add cooresponding uid [puppet] - 10https://gerrit.wikimedia.org/r/520039 (owner: 10Cwhite) [18:37:51] (03CR) 10Cwhite: [C: 03+2] grafana: remove id field from dashboard and add cooresponding uid [puppet] - 10https://gerrit.wikimedia.org/r/520039 (owner: 10Cwhite) [18:37:59] (03PS3) 10Cwhite: grafana: remove id field from dashboard and add cooresponding uid [puppet] - 10https://gerrit.wikimedia.org/r/520039 [18:43:24] (03CR) 10Bstorm: [C: 03+2] toolsforge: fix up the kubeadm config a bit more and configure docker [puppet] - 10https://gerrit.wikimedia.org/r/520052 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [18:43:31] (03PS2) 10Bstorm: toolsforge: fix up the kubeadm config a bit more and configure docker [puppet] - 10https://gerrit.wikimedia.org/r/520052 (https://phabricator.wikimedia.org/T215531) [18:45:23] (03PS1) 10ArielGlenn: defer start of dump run until July 2 [puppet] - 10https://gerrit.wikimedia.org/r/520053 [18:45:40] volans: \o/ [18:45:50] (03PS2) 10Urbanecm: Don't show cannot publish error to 'sysop' users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic) [18:46:07] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic) [18:46:10] (03PS2) 10ArielGlenn: defer start of dump run until July 2 [puppet] - 10https://gerrit.wikimedia.org/r/520053 [18:46:12] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Bstorm) private1, I believe. I mean the internal network. I *think* we are moving away from the lab private network, right @bd808 ? [18:46:12] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/CentralAuth: SWAT: [[:gerrit:519780|Require only one user group to allow publishing to main namespace]] (T225398) (duration: 00m 51s) [18:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:17] T225398: English Wikipedia shows the "cannot be published" warning, even though I would be able to publish - https://phabricator.wikimedia.org/T225398 [18:46:48] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10elukey) All logs (they were emitted only at the end): ` Mon Jul 1 17:27:36 2019 - INFO: No-installation mode selected... [18:47:02] (03CR) 10ArielGlenn: [C: 03+2] defer start of dump run until July 2 [puppet] - 10https://gerrit.wikimedia.org/r/520053 (owner: 10ArielGlenn) [18:47:06] (03Merged) 10jenkins-bot: Don't show cannot publish error to 'sysop' users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic) [18:48:15] (03PS1) 10Elukey: Introduce an-tool1006 [puppet] - 10https://gerrit.wikimedia.org/r/520055 (https://phabricator.wikimedia.org/T226844) [18:49:29] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SSWAT: [[:gerrit:518260|Dont show cannot publish error to sysop users]] (T225398) (duration: 00m 49s) [18:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:39] RoanKattouw, I'm done, thanks [18:50:23] Thanks, I'll deploy my patch now [18:51:06] (03PS3) 10Catrope: Enable GrowthExperiments homepage for 50% of new users on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519081 [18:51:55] RoanKattouw, ehh, I've made a wrong sync, CentralAuth instead of ContentTranslation :/. I guess I might run a scap sync-file while you're deploying? [18:52:05] Go for it, I haven't started yet [18:52:09] lmk when you're done [18:52:21] (03CR) 10Catrope: [C: 03+2] Enable GrowthExperiments homepage for 50% of new users on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519081 (owner: 10Catrope) [18:53:10] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/ContentTranslation: SWAT: [[:gerrit:519780|Require only one user group to allow publishing to main namespace]] (T225398) (duration: 00m 49s) [18:53:12] (03CR) 10jenkins-bot: Don't show cannot publish error to 'sysop' users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic) [18:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:16] T225398: English Wikipedia shows the "cannot be published" warning, even though I would be able to publish - https://phabricator.wikimedia.org/T225398 [18:53:18] (03Merged) 10jenkins-bot: Enable GrowthExperiments homepage for 50% of new users on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519081 (owner: 10Catrope) [18:53:22] RoanKattouw, now I should be really done :) [18:53:35] (03CR) 10jenkins-bot: Enable GrowthExperiments homepage for 50% of new users on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519081 (owner: 10Catrope) [19:04:56] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Special:Homepage for 50% of new users on viwiki (duration: 00m 49s) [19:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:28] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [19:22:51] 10Operations, 10ops-eqiad: rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10RobH) p:05Triageβ†’03Normal [19:23:05] 10Operations, 10ops-eqiad: rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10RobH) [19:24:11] 10Operations, 10ops-eqiad: rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10RobH) @elukey: Please provide both hostnames and racking information for these nodes. Also include: ip/subnet info (private or public subnet), OS distro, and partitioning scheme, then assign... [19:24:51] (03PS1) 10Bstorm: toolforge: declare service for docker [puppet] - 10https://gerrit.wikimedia.org/r/520058 (https://phabricator.wikimedia.org/T215531) [19:25:50] (03CR) 10Bstorm: [C: 03+2] toolforge: declare service for docker [puppet] - 10https://gerrit.wikimedia.org/r/520058 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [19:30:17] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) @Bstorm Thanks. [19:30:24] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10Dzahn) regarding the very last line.. where it outputs the MAC address. In the original script i was thinking "maybe we... [19:31:56] !log removing nine files for legal compliance [19:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:24] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10bd808) >>! In T224528#5297520, @Bstorm wrote: > private1, I believe. I mean the internal network. I *think* we are moving away from the l... [19:35:35] (03CR) 10CDanis: Gerrit: Wait 5s before starting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520016 (owner: 10Paladox) [19:35:50] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [19:37:53] (03CR) 10Paladox: Gerrit: Wait 5s before starting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520016 (owner: 10Paladox) [19:41:44] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:49:54] (03PS1) 10Ottomata: Limit EventStreams per X-Client-IP connections [puppet] - 10https://gerrit.wikimedia.org/r/520068 (https://phabricator.wikimedia.org/T226808) [19:51:25] (03CR) 10CDanis: Gerrit: Wait 5s before starting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520016 (owner: 10Paladox) [19:56:51] 10Puppet, 10cloud-services-team (Kanban): Decouple core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10Andrew) [19:56:57] (03PS3) 10Paladox: Gerrit: Wait 5s before starting [puppet] - 10https://gerrit.wikimedia.org/r/520016 [19:57:27] (03PS4) 10CDanis: Gerrit: Wait 5s before starting [puppet] - 10https://gerrit.wikimedia.org/r/520016 (owner: 10Paladox) [19:57:41] (03CR) 10CDanis: [C: 03+2] Gerrit: Wait 5s before starting [puppet] - 10https://gerrit.wikimedia.org/r/520016 (owner: 10Paladox) [19:57:52] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [19:57:55] 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team: cron spam for slow queries on mwmaint /usr/local/bin/foreachwiki initSiteStats.php --update > /dev/null - https://phabricator.wikimedia.org/T216243 (10kchapman) a:03aaron [19:57:58] thank you cdanis! [20:00:04] cscott, arlolra, subbu, bearND, and halfak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190701T2000). [20:01:46] (03CR) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi) [20:01:47] 10Puppet, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10Andrew) [20:03:06] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [20:05:45] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10Volans) @elukey: yes because of the temporary suppression of cumin's default output, to allow each cookbook to decide wh... [20:15:18] !log milimetric@deploy1001 Started deploy [analytics/refinery@4e9894c]: minor, just removing hiwikisource from sqoop list [20:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:33] (03CR) 10Smalyshev: [C: 03+1] "I think it looks good now, should we proceed with testing/deploying it?" [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [20:18:57] (03PS2) 10Ottomata: Limit EventStreams per X-Client-IP connections [puppet] - 10https://gerrit.wikimedia.org/r/520068 (https://phabricator.wikimedia.org/T226808) [20:19:55] (03CR) 10Smalyshev: [C: 03+1] Wikidata dumps: Update minimum expected sizes [puppet] - 10https://gerrit.wikimedia.org/r/519493 (https://phabricator.wikimedia.org/T226601) (owner: 10Hoo man) [20:20:47] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, and 2 others: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Krinkle) [20:20:54] (03CR) 10Ottomata: [C: 03+2] "Looks good. https://puppet-compiler.wmflabs.org/compiler1002/17176/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/520068 (https://phabricator.wikimedia.org/T226808) (owner: 10Ottomata) [20:21:05] (03PS3) 10Ottomata: Limit EventStreams per X-Client-IP connections [puppet] - 10https://gerrit.wikimedia.org/r/520068 (https://phabricator.wikimedia.org/T226808) [20:23:36] PROBLEM - Disk space on notebook1003 is CRITICAL: DISK CRITICAL - free space: /srv 4181 MB (3% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [20:23:44] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:24:09] (03PS1) 10Smalyshev: Enable RDF output for MediaInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520078 (https://phabricator.wikimedia.org/T221916) [20:25:06] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [20:25:14] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:26:59] 10Operations, 10Performance-Team, 10Traffic: Investigate source of 404 Not Found responses from load.php - https://phabricator.wikimedia.org/T202479 (10Krinkle) 05Stalledβ†’03Invalid [20:27:16] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: /srv 2198 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [20:27:23] looking [20:27:58] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [20:28:12] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:28:18] 10Operations, 10DBA, 10Performance-Team, 10Availability (MediaWiki-MultiDC): Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10Krinkle) 05Openβ†’03Stalled Blocked on {T175672} or {T196378}. [20:28:20] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:28:42] RECOVERY - Disk space on an-coord1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [20:28:58] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:32:17] !log milimetric@deploy1001 Finished deploy [analytics/refinery@4e9894c]: minor, just removing hiwikisource from sqoop list (duration: 16m 59s) [20:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:25] !log milimetric@deploy1001 Started deploy [analytics/refinery@4e9894c]: minor, just removing hiwikisource from sqoop list [20:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:44] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:32:51] 10Operations, 10Analytics, 10Patch-For-Review, 10Security, and 2 others: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) Ok, all patches ready to go. Deployed in beta and looks good there. It is near the end of... [20:33:22] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:33:58] !log milimetric@deploy1001 Finished deploy [analytics/refinery@4e9894c]: minor, just removing hiwikisource from sqoop list (duration: 01m 33s) [20:34:00] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:04] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:35:16] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [20:35:26] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:40:08] 10Operations, 10observability, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10CDanis) I think I just found another one that was missed: https://grafana.wikimedia.org/d/000000545/ganeti [20:40:18] (03PS2) 10Smalyshev: Enable RDF output for MediaInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520078 (https://phabricator.wikimedia.org/T221916) [20:44:42] (03CR) 10ArielGlenn: "> I think it looks good now, should we proceed with testing/deploying" [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [20:46:38] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:46:48] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:57:36] (03PS10) 10Volans: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [21:00:04] bawolff and Reedy: Dear deployers, time to do the Weekly Security deployment window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190701T2100). [21:02:24] awww [21:02:44] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:02:45] bawolff.. [21:02:56] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:03:05] (03CR) 10Volans: [C: 04-1] "Looks mostly ok, just very few small things and some nitpick inline." (0314 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [21:07:22] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:08:40] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:16:38] (03PS5) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) [21:17:51] 10Operations, 10Continuous-Integration-Infrastructure, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Switch CI Docker Storage Driver to its own partition and to use devicemapper - https://phabricator.wikimedia.org/T178663 (10Dzahn) a:03Dzahn [21:18:00] (03PS6) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) [21:18:08] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) a:03Dzahn [21:19:21] (03CR) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi) [21:21:52] (03CR) 10Volans: "Looks ok, just a couple of suggestions inline." (034 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/517605 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [21:22:15] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10jijiki) [21:32:58] (03PS1) 10ArielGlenn: start July 1 dump run at 10 pm utc [puppet] - 10https://gerrit.wikimedia.org/r/520124 [21:34:34] (03CR) 10ArielGlenn: [C: 03+2] start July 1 dump run at 10 pm utc [puppet] - 10https://gerrit.wikimedia.org/r/520124 (owner: 10ArielGlenn) [21:49:44] 10Puppet, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10bd808) * What will this setup look like to the admins of a Cloud VPS instance? Is it all easy to find from /etc/pu... [22:05:21] 10Operations, 10Wikimedia-Site-requests: Global rename of Waldir β†’ Waldyrious: supervision needed - https://phabricator.wikimedia.org/T225370 (10waldyrious) Thanks! Are there any cleanup changes I need to perform after this? I noticed for example that my watchlist RSS feeds became invalidated, and I needed to... [22:06:08] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10DIKW_Pyramid) @tstarling, this problem: > The m. subdomain, as in en.m.wikipedia.org, is... [22:07:05] (03PS1) 10CDanis: admin: rotate my yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/520130 [22:09:18] (03CR) 10CDanis: [C: 03+2] admin: rotate my yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/520130 (owner: 10CDanis) [22:11:46] PROBLEM - Host analytics1056 is DOWN: PING CRITICAL - Packet loss = 100% [22:16:56] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10Andrew) [22:17:08] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10Andrew) btw, I'm happy to actually set up the VMs, only assigning to Alex to approve the resource usage. [22:22:21] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10Andrew) [22:24:43] 10Operations, 10Wikimedia-Site-requests: Global rename of Waldir β†’ Waldyrious: supervision needed - https://phabricator.wikimedia.org/T225370 (10waldyrious) Also, is there any similar process for renaming my related account(s) on Wikitech, Gerrit, Toolforge, etc.? I don't mean to add on more work to this task... [22:24:50] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2546 MB (5% inode=62%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [22:25:20] ^ i took the tickets for that but i dont have access yet [22:25:27] we will add the new disks soon though [22:26:18] RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [22:35:27] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10JHedden) I feel that 1 CPU might be too limiting, haproxy is multi-threaded and we'll have a number of backends defined. 2/2/20 would be... [22:46:08] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) Re: Wikitravel: Don't compare their custom HTML homepage in English. Compare (e.g.... [22:52:48] 10Operations, 10observability, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10colewhite) @CDanis thanks for the heads up. Should look better now. [22:53:06] 10Puppet, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10Andrew) >>! In T227029#5298427, @bd808 wrote: > * What will this setup look like to the admins of a Cloud VPS inst... [22:58:05] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for cloudbackup200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/520137 [23:00:04] MaxSem, RoanKattouw, and Niharika: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190701T2300). Please do the needful. [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:20:01] 10Puppet, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10bd808) >>! In T227029#5298596, @Andrew wrote: > puppet agent -tv --config /etc/basepuppet/puppet.conf > > and >... [23:28:39] 10Operations, 10Performance-Team, 10TechCom-RFC, 10Traffic, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Ladsgroup) Given that the last comment on this ticket was for around a year ago, I don't think it falls in category of "under discussio... [23:35:18] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10DIKW_Pyramid) @Koavf, ok I see, it work. But it produces ugly uri: https://wikitravel.or...