[00:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180104T0000). [00:00:04] Amir1: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:11] o/ [00:00:20] not testable, quick [00:06:07] bd808: so personal pc's laptops like ours are affected... that makes 90% of the machines... [00:08:41] probably more than 90% [00:11:37] Hauskatze: I'd guess more like 99.99% of all things with a CPU [00:12:11] * bd808 buys stock in pen and paper manufactures [00:12:30] typewritters and safes [00:12:37] and a shotgun under your desk [00:13:06] good night [00:15:14] RECOVERY - High CPU load on API appserver on mw1221 is OK: OK - load average: 6.30, 12.37, 23.52 [00:41:04] As I said elsewhere: good thing I use a laptop and not a desktop! [00:41:08] I'm safe! [00:41:37] true not all computers are going to be using functionality where this is a huge deal [00:43:39] but stuff that runs untrusted VMs or JavaScript etc.... that's gonna be a huge percentage of things affected [00:49:19] 10Operations, 10Cleanup, 10Continuous-Integration-Config, 10Gerrit, and 6 others: Archive mediawiki/extensions/Collection and others - https://phabricator.wikimedia.org/T183891#3873797 (10Tgr) Archiving a stable extension should involve some amount of public dicsussion, not just someone making an arbitrary... [00:51:23] well, that's often pretty much the point why people use VMs... [00:53:10] some VMs are more trustworthy than others [00:55:45] there is a huge difference between VMs on hosts shared with anyone (e.g. public cloud providers, Labs, etc.), and the ganeti stuff running in prod where the guests are managed as production hosts [00:55:59] 10Operations, 10Cleanup, 10Continuous-Integration-Config, 10Gerrit, and 6 others: Archive mediawiki/extensions/Collection and others - https://phabricator.wikimedia.org/T183891#3866385 (10demon) Just because it isn't used at WMF anymore doesn't mean it's worth archiving. I'm inclined to deny this. [00:56:15] 10Operations, 10Cleanup, 10Continuous-Integration-Config, 10Gerrit, and 6 others: Archive mediawiki/extensions/Collection and others - https://phabricator.wikimedia.org/T183891#3873803 (10demon) (aka: what do the author(s) have to say?) [00:56:28] it's still not good of course [01:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180104T0100). [01:00:04] No GERRIT patches in the queue for this window AFAICS. [01:00:10] 10Operations, 10ops-codfw, 10Cloud-VPS: Connect labtestvirt2003 eth1 and eth2 interface(s) to switch fabric - https://phabricator.wikimedia.org/T183167#3873806 (10Papaul) @RobH This was not on my dashboard so I missed it. I will get on it when back at the DC tomorrow. [01:39:59] reading the details of these vulns, spectre sounds even scarier than meltdown in a way, because it seems impossible to fix, and affects all processors [01:40:15] no_justification: no one deployed the SWAT, can we do it now? [01:41:00] Amir1: Do swat? I'm about to walk out the door.... [01:41:32] I can do it [01:41:39] just was thinking if it's okay [01:44:14] I don't see why not :) [01:44:17] All seems quiet [01:44:25] cool [01:44:29] (he says, as he walks away from any responsibility) [01:45:00] I think it should be okay, the patch is super straightforward [01:45:36] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398704 (https://phabricator.wikimedia.org/T182326) (owner: 10Ladsgroup) [01:47:02] (03Merged) 10jenkins-bot: Move testwiki2 from group0 to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398704 (https://phabricator.wikimedia.org/T182326) (owner: 10Ladsgroup) [01:47:13] (03CR) 10jenkins-bot: Move testwiki2 from group0 to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398704 (https://phabricator.wikimedia.org/T182326) (owner: 10Ladsgroup) [01:49:52] I get lots of things like this when deploying with scap [01:49:55] https://www.irccloud.com/pastebin/BFCq0qcL/ [01:50:05] !log ladsgroup@tin Synchronized dblists/group0.dblist: SWAT: Move testwiki2 from group0 to group1 (T182326) (duration: 01m 02s) [01:50:10] ignore it [01:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:17] T182326: Make one group1 wiki a client of testwikidata (preferably a test wiki) - https://phabricator.wikimedia.org/T182326 [01:51:20] okay [01:51:26] the deployment is done now [01:58:08] (03PS1) 10Dzahn: peopleweb: access based on roles, not host names [puppet] - 10https://gerrit.wikimedia.org/r/401829 [02:17:34] PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [02:25:03] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.12) (duration: 07m 50s) [02:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:04] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1515033119 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3115013 keys, up 4 minutes 9 seconds - replication_delay is 1515033119 [02:33:05] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3098559 keys, up 5 minutes 9 seconds - replication_delay is 0 [02:37:34] RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [02:49:26] !log legoktm@tin Synchronized php-1.31.0-wmf.15/extensions/Flow/Hooks.php: Fix CheckUser type check thingy - T182834 (duration: 01m 01s) [02:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:37] T182834: Argument 1 passed to FlowHooks::onSpecialCheckUserGetLinksFromRow() must be an instance of CheckUser, SpecialCheckUser given - https://phabricator.wikimedia.org/T182834 [03:07:34] PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [03:12:27] (03CR) 10Dzahn: [C: 031] "i would like to just merge it as is to be able to test it on my Cloud VPS project without having to use local puppetmaster, next i would l" [puppet] - 10https://gerrit.wikimedia.org/r/400100 (owner: 10Giuseppe Lavagetto) [03:16:54] PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [03:17:34] RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [03:26:05] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 782.36 seconds [04:05:14] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 267.44 seconds [04:52:25] PROBLEM - HHVM rendering on mw2122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:15] RECOVERY - HHVM rendering on mw2122 is OK: HTTP OK: HTTP/1.1 200 OK - 73439 bytes in 0.323 second response time [05:17:34] PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [05:47:34] RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [05:56:54] RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [05:57:35] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [05:57:45] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0 [06:00:54] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 [06:02:44] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [06:17:34] PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [06:22:54] 10Operations, 10ops-eqiad, 10DBA: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3874338 (10Marostegui) [06:23:20] !log Issue a BBU re-learn cycle on db1059 - T184160 [06:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:33] T184160: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160 [06:27:55] !log Deploy schema change on db1068 (s4) master - T174569 [06:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:08] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:33:10] 10Operations, 10ops-eqiad, 10DBA: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3874356 (10Marostegui) ``` Time: Fri Nov 24 23:39:07 2017 Event Description: Battery started charging Time: Fri Nov 24 23:46:42 2017 Event Description: Battery charge complete Time: Sun Nov 26 08:04:47 20... [06:35:14] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401899 (https://phabricator.wikimedia.org/T174569) [06:36:13] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401899 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:37:06] (03PS2) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401899 (https://phabricator.wikimedia.org/T174569) [06:37:34] RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [06:38:05] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401899 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:40:20] (03Abandoned) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401899 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:41:15] 10Operations, 10ops-eqiad, 10DBA: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3874359 (10Marostegui) After the manual relearn: ``` ˜/icinga-wm 7:37> RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ``` Don't know for how long it will last [06:42:42] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401900 (https://phabricator.wikimedia.org/T174569) [06:43:00] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3874361 (10Marostegui) This host failed again and recovered itself: ``` 03:16 < icinga-wm> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, cu... [06:45:05] 10Operations, 10ops-eqiad, 10DBA: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3874362 (10Marostegui) p:05Triage>03Normal [06:46:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401900 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:46:55] (03PS2) 10Giuseppe Lavagetto: site.pp: convert dns recursors to single role [puppet] - 10https://gerrit.wikimedia.org/r/401547 [06:47:20] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9532/" [puppet] - 10https://gerrit.wikimedia.org/r/401547 (owner: 10Giuseppe Lavagetto) [06:47:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401900 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:47:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401900 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:48:27] !log Deploy schema change on db1079 (s7) with replication enabled - this will generate lag on labs replicas - T174569 [06:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:39] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:48:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1079 - T174569 (duration: 01m 02s) [06:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:50] (03PS2) 10Giuseppe Lavagetto: bastionhost: add role for caching PoPs [puppet] - 10https://gerrit.wikimedia.org/r/401548 [07:03:48] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3874373 (10Marostegui) a:05Cmjohnson>03Marostegui [07:07:34] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3874375 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1113.eqiad.wmnet', 'db111... [07:07:34] PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [07:11:04] 10Operations, 10ops-eqiad, 10DBA: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3874377 (10Marostegui) a:03Cmjohnson ``` PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough ``` We should replace the BBU [07:13:22] (03PS2) 10ArielGlenn: add wmflabs config for dumps scap [dumps/scap] - 10https://gerrit.wikimedia.org/r/400598 [07:15:06] (03CR) 10ArielGlenn: "Yeah the dumpsgen user is included in the profiles that are applied via any of the snapshot roles." (031 comment) [dumps/scap] - 10https://gerrit.wikimedia.org/r/400598 (owner: 10ArielGlenn) [07:31:51] PROBLEM - Check the NTP synchronisation status of timesyncd on db1113 is CRITICAL: Return code of 255 is out of bounds [07:31:51] PROBLEM - DPKG on db1114 is CRITICAL: Return code of 255 is out of bounds [07:33:31] PROBLEM - DPKG on db1113 is CRITICAL: Return code of 255 is out of bounds [07:33:31] PROBLEM - Disk space on db1114 is CRITICAL: Return code of 255 is out of bounds [07:35:20] PROBLEM - Disk space on db1113 is CRITICAL: Return code of 255 is out of bounds [07:37:20] RECOVERY - Disk space on db1113 is OK: DISK OK [07:37:31] RECOVERY - DPKG on db1114 is OK: All packages OK [07:37:40] RECOVERY - Disk space on db1114 is OK: DISK OK [07:37:41] RECOVERY - DPKG on db1113 is OK: All packages OK [07:38:59] (03CR) 10ArielGlenn: [V: 032 C: 032] add wmflabs config for dumps scap [dumps/scap] - 10https://gerrit.wikimedia.org/r/400598 (owner: 10ArielGlenn) [07:43:41] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3874390 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1113.eqiad.wmnet', 'db1114.eqiad.wmnet'] ``` and were **ALL** successful. [07:55:38] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3874396 (10Marostegui) I put the wrong task ID, it was meant to be T182896 Sorry! [07:56:26] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3837888 (10Marostegui) [08:01:54] RECOVERY - Check the NTP synchronisation status of timesyncd on db1113 is OK: OK: synced at Thu 2018-01-04 08:01:46 UTC. [08:01:58] (03PS2) 10ArielGlenn: add dumps repo source to beta scap, add snapshot to beta mw scap [puppet] - 10https://gerrit.wikimedia.org/r/400237 [08:02:08] (03CR) 10ArielGlenn: add dumps repo source to beta scap, add snapshot to beta mw scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/400237 (owner: 10ArielGlenn) [08:03:00] (03CR) 10ArielGlenn: [C: 032] add dumps repo source to beta scap, add snapshot to beta mw scap [puppet] - 10https://gerrit.wikimedia.org/r/400237 (owner: 10ArielGlenn) [08:22:57] (03PS1) 10Elukey: profile::hadoop::master: remove the last hadoop cdh auto-lookup [puppet] - 10https://gerrit.wikimedia.org/r/401904 (https://phabricator.wikimedia.org/T167790) [08:22:59] (03PS1) 10Marostegui: site.pp: Add db111{3,4} to spare [puppet] - 10https://gerrit.wikimedia.org/r/401905 (https://phabricator.wikimedia.org/T184161) [08:24:29] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3874449 (10Marostegui) 05Open>03Resolved [08:25:57] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/9538/" [puppet] - 10https://gerrit.wikimedia.org/r/401904 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [08:26:42] (03PS3) 10Giuseppe Lavagetto: bastionhost: add role for caching PoPs [puppet] - 10https://gerrit.wikimedia.org/r/401548 [08:28:51] (03PS1) 10Elukey: role::analytics_cluster::coordinator: fix system::role [puppet] - 10https://gerrit.wikimedia.org/r/401907 (https://phabricator.wikimedia.org/T167790) [08:29:17] (03CR) 10Elukey: [C: 032] role::analytics_cluster::coordinator: fix system::role [puppet] - 10https://gerrit.wikimedia.org/r/401907 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [08:30:43] (03CR) 10Marostegui: [C: 032] site.pp: Add db111{3,4} to spare [puppet] - 10https://gerrit.wikimedia.org/r/401905 (https://phabricator.wikimedia.org/T184161) (owner: 10Marostegui) [08:30:48] (03PS2) 10Marostegui: site.pp: Add db111{3,4} to spare [puppet] - 10https://gerrit.wikimedia.org/r/401905 (https://phabricator.wikimedia.org/T184161) [08:31:21] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401909 [08:34:48] 10Operations, 10Performance-Team, 10HHVM, 10Patch-For-Review: HHVM hangs on the API cluster - https://phabricator.wikimedia.org/T184048#3874481 (10Joe) @Imarlier no I think there isn't much we can do until we have a reproduction case. For now I'm focusing on mitigations for this issue as 1) we're not on th... [08:43:43] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401909 (owner: 10Marostegui) [08:45:01] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401909 (owner: 10Marostegui) [08:45:49] (03PS4) 10Giuseppe Lavagetto: bastionhost: add role for caching PoPs [puppet] - 10https://gerrit.wikimedia.org/r/401548 [08:46:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1079 - T174569 (duration: 01m 02s) [08:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:29] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [08:46:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401909 (owner: 10Marostegui) [08:47:11] (03CR) 10Giuseppe Lavagetto: [C: 032] bastionhost: add role for caching PoPs [puppet] - 10https://gerrit.wikimedia.org/r/401548 (owner: 10Giuseppe Lavagetto) [08:48:55] !log Deploy schema change on db1069 (s7) - T174569 [08:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:55] PROBLEM - Check size of conntrack table on mw1336 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [08:53:04] !log Fixing inconsistencies on s7 - T163190 [08:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:16] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [08:54:05] PROBLEM - Check size of conntrack table on mw1337 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [08:55:29] so these are the new jobrunners [08:55:41] I bet that the race condition for the conntrack is wrong [08:55:41] fixing [08:56:05] RECOVERY - Check size of conntrack table on mw1337 is OK: OK: nf_conntrack is 78 % full [08:56:25] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdf1] [08:57:02] !log set sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 on mw133[67] (new jobrunners) [08:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:34] RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [08:58:01] RECOVERY - Check size of conntrack table on mw1336 is OK: OK: nf_conntrack is 77 % full [08:58:44] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T184053#3874527 (10fgiunchedi) 05Open>03Resolved Thanks @Cmjohnson ! Disk is rebuilding. [08:59:22] (03CR) 10Gehel: [C: 031] "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/compiler02/9540/" [puppet] - 10https://gerrit.wikimedia.org/r/399954 (https://phabricator.wikimedia.org/T178978) (owner: 10Smalyshev) [09:01:31] RECOVERY - puppet last run on ms-be1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:02:52] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master-00 consumer/mysql-eventbus [09:05:04] ah yes this is ok, downtime expired --^ [09:12:21] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [09:13:22] PROBLEM - Apache HTTP on mw2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:13:26] !log rebooting kubernetes1001 for kernel update [09:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:12] RECOVERY - Apache HTTP on mw2125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.128 second response time [09:15:21] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [09:17:46] (03PS1) 10Filippo Giunchedi: graphite: cleanup stale ORES metrics [puppet] - 10https://gerrit.wikimedia.org/r/401917 (https://phabricator.wikimedia.org/T169969) [09:18:08] (03CR) 10jerkins-bot: [V: 04-1] graphite: cleanup stale ORES metrics [puppet] - 10https://gerrit.wikimedia.org/r/401917 (https://phabricator.wikimedia.org/T169969) (owner: 10Filippo Giunchedi) [09:19:14] marostegui: I am checking https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors and the above mw exceptions are saying "Could not wait for replica DBs to catch up to db1062" - expected? [09:21:15] (03PS2) 10Filippo Giunchedi: graphite: cleanup stale ORES metrics [puppet] - 10https://gerrit.wikimedia.org/r/401917 (https://phabricator.wikimedia.org/T169969) [09:21:36] (03CR) 10jerkins-bot: [V: 04-1] graphite: cleanup stale ORES metrics [puppet] - 10https://gerrit.wikimedia.org/r/401917 (https://phabricator.wikimedia.org/T169969) (owner: 10Filippo Giunchedi) [09:22:53] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/9542/graphite1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/401917 (https://phabricator.wikimedia.org/T169969) (owner: 10Filippo Giunchedi) [09:25:02] elukey: checking [09:25:17] elukey: given that that is a master and has no replication, that is really bad [09:25:39] or is it happeneing on other host? [09:26:31] "Timed out waiting on db1101:3317" [09:26:47] is it down? [09:27:05] i can see it fine [09:27:16] https://logstash.wikimedia.org/goto/cc705e4aa21677c0e9a9ebce69235622 [09:27:36] it was a 10 min blip afaics [09:27:46] now the exceptions have fully recovered [09:27:51] Server db1101:3317 has 63.380876064301 seconds of lag [09:29:00] That could have been me, while fixing incosistencies [09:29:06] I am checking the graphs, and matches the times [09:29:32] I think I will prewarm the tables and throttle it a bit [09:29:51] PROBLEM - SSH on ms-be1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:29:51] it seems it was a large replace [09:29:52] Because yesterday I saw no issues on other hosts, but these ones are multi-instance and have less buffer pool, so it coukld be that [09:30:00] jynus: yep, that was me [09:30:31] I do not think it was performance [09:30:35] but locking [09:30:55] see the threads running stats [09:31:41] RECOVERY - SSH on ms-be1013 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u1 (protocol 2.0) [09:31:47] could be, yeah, I will throttle it more then [09:31:51] PROBLEM - very high load average likely xfs on ms-be1013 is CRITICAL: CRITICAL - load average: 112.76, 105.24, 68.57 [09:32:39] in theory, one replica having issues should not affect mediawiki, in practice, because that ticket, it does :-( [09:33:23] actually, I am going to depool it, it will be easier [09:34:51] RECOVERY - very high load average likely xfs on ms-be1013 is OK: OK - load average: 52.71, 79.99, 65.11 [09:35:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401925 (https://phabricator.wikimedia.org/T163190) [09:38:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401925 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [09:38:30] !log rebooting mw1307 and wtp1025 for kernel update [09:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401925 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [09:39:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401925 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [09:43:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1101:3317 - T163190 (duration: 03m 09s) [09:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:45] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [09:46:39] 10Operations, 10Continuous-Integration-Config: tox 2.5.0 on phabricator-jessie-diffs fails with ERROR: Commands not specified - https://phabricator.wikimedia.org/T184060#3874550 (10fgiunchedi) My point is more like that's a regression in tox 2.5.0 (i.e. environment without `commands` is invalid) that got rever... [09:50:04] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Move deployment-prep redis instances to stretch - https://phabricator.wikimedia.org/T179371#3874565 (10fgiunchedi) So IIRC @Pchelolo has finished running their tests in deployment-prep that used redis. So we could actuall... [09:54:18] (03PS2) 10Giuseppe Lavagetto: site.pp: use role keyword for striker::web only on californium [puppet] - 10https://gerrit.wikimedia.org/r/401549 [09:58:35] !log restart and upgrade db2053 [09:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:22] (03PS3) 10Giuseppe Lavagetto: site.pp: use role keyword for striker::web only on californium [puppet] - 10https://gerrit.wikimedia.org/r/401549 [10:04:35] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9544/" [puppet] - 10https://gerrit.wikimedia.org/r/401549 (owner: 10Giuseppe Lavagetto) [10:09:17] (03PS1) 10Elukey: Refactor thorium's roles in one [puppet] - 10https://gerrit.wikimedia.org/r/401927 (https://phabricator.wikimedia.org/T167790) [10:14:14] !log mobrovac@tin Started deploy [mathoid/deploy@7f664ff]: Update Mathoid in codfw to v0.7.0, take #2 - T183557 [10:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:26] T183557: Mathoid v0.7.0 not accepting chem formula - https://phabricator.wikimedia.org/T183557 [10:14:35] (03PS2) 10Elukey: Refactor thorium's roles in one [puppet] - 10https://gerrit.wikimedia.org/r/401927 (https://phabricator.wikimedia.org/T167790) [10:16:52] !log mobrovac@tin Finished deploy [mathoid/deploy@7f664ff]: Update Mathoid in codfw to v0.7.0, take #2 - T183557 (duration: 02m 38s) [10:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:13] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/9546/thorium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/401927 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [10:20:00] (03PS1) 10Jcrespo: mariadb: Move db2053 socket to /run [puppet] - 10https://gerrit.wikimedia.org/r/401928 (https://phabricator.wikimedia.org/T148507) [10:20:32] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2053 socket to /run [puppet] - 10https://gerrit.wikimedia.org/r/401928 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [10:21:26] (03PS1) 10Alexandros Kosiaris: Disable docker bridge in production/staging [puppet] - 10https://gerrit.wikimedia.org/r/401929 [10:21:58] (03PS3) 10Elukey: Refactor thorium's roles in one [puppet] - 10https://gerrit.wikimedia.org/r/401927 (https://phabricator.wikimedia.org/T167790) [10:25:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401930 (https://phabricator.wikimedia.org/T163190) [10:29:21] (03PS2) 10Giuseppe Lavagetto: cache: add ipsec to basic roles [puppet] - 10https://gerrit.wikimedia.org/r/401550 [10:29:53] (03CR) 10Giuseppe Lavagetto: [C: 031] Disable docker bridge in production/staging [puppet] - 10https://gerrit.wikimedia.org/r/401929 (owner: 10Alexandros Kosiaris) [10:31:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401930 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [10:37:37] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401930 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [10:39:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1079 - T163190 (duration: 01m 02s) [10:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:16] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [10:39:17] !log Stop replication in sync on db1079 and db1101:3317 - T163190 [10:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:04] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler03/9550/ shows this is a noop." [puppet] - 10https://gerrit.wikimedia.org/r/401550 (owner: 10Giuseppe Lavagetto) [10:40:38] (03CR) 10Ema: [C: 031] cache: add ipsec to basic roles [puppet] - 10https://gerrit.wikimedia.org/r/401550 (owner: 10Giuseppe Lavagetto) [10:44:57] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401935 [10:46:46] (03PS9) 10Ema: mtail: add program to count varnish backend metrics [puppet] - 10https://gerrit.wikimedia.org/r/401535 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [10:47:47] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401935 (owner: 10Marostegui) [10:50:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401935 (owner: 10Marostegui) [10:51:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1079 - T163190 (duration: 01m 01s) [10:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:58] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [10:55:25] (03PS1) 10Alexandros Kosiaris: ifguard $realm and $cluster with defined() [puppet] - 10https://gerrit.wikimedia.org/r/401974 [10:57:01] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401930 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [10:57:05] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401935 (owner: 10Marostegui) [10:57:42] (03PS3) 10Giuseppe Lavagetto: cache: add ipsec to basic roles [puppet] - 10https://gerrit.wikimedia.org/r/401550 [11:00:46] (03CR) 10Giuseppe Lavagetto: [C: 032] cache: add ipsec to basic roles [puppet] - 10https://gerrit.wikimedia.org/r/401550 (owner: 10Giuseppe Lavagetto) [11:06:21] PROBLEM - HHVM rendering on mw2107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:07:11] RECOVERY - HHVM rendering on mw2107 is OK: HTTP OK: HTTP/1.1 200 OK - 73425 bytes in 0.317 second response time [11:10:00] (03PS7) 10Ema: varnish: add varnishmtail instance for varnish backends [puppet] - 10https://gerrit.wikimedia.org/r/401526 (https://phabricator.wikimedia.org/T177199) [11:10:10] (03CR) 10Ema: [V: 032 C: 032] varnish: add varnishmtail instance for varnish backends [puppet] - 10https://gerrit.wikimedia.org/r/401526 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [11:11:31] (03CR) 10Ema: [C: 032] mtail: add program to count varnish backend metrics [puppet] - 10https://gerrit.wikimedia.org/r/401535 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [11:11:37] (03PS10) 10Ema: mtail: add program to count varnish backend metrics [puppet] - 10https://gerrit.wikimedia.org/r/401535 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [11:11:45] (03CR) 10Ema: [V: 032 C: 032] mtail: add program to count varnish backend metrics [puppet] - 10https://gerrit.wikimedia.org/r/401535 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [11:12:56] <_joe_> ema: I'll reenable puppet everywhere shortly [11:13:05] _joe_: thanks [11:13:35] <_joe_> ema: not sure everything works as expected though [11:13:48] <_joe_> ema: puppet has run on cp1048 and now on cp1052 [11:13:54] <_joe_> with your changes [11:14:06] _joe_: let's see [11:14:10] <_joe_> do you want to check those hosts before reenabling? [11:14:44] _joe_: just checked, the change worked fine [11:14:52] CC: godog [11:14:55] <_joe_> nevermind, it was a partial merge apparently with my first puppet run [11:15:22] <_joe_> ema: reenabling then :) [11:15:35] yes, please! [11:15:45] <_joe_> {{done}} [11:16:29] (03PS2) 10Giuseppe Lavagetto: site.pp: simplify role() keyword call for cache::canary [puppet] - 10https://gerrit.wikimedia.org/r/401551 [11:17:00] (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: simplify role() keyword call for cache::canary [puppet] - 10https://gerrit.wikimedia.org/r/401551 (owner: 10Giuseppe Lavagetto) [11:17:47] <_joe_> ema: do you think puppet can be reenabled on cp1008? [11:18:21] <_joe_> if not, that's ok, but it's dis able since forever [11:18:54] _joe_: I'll check in a second [11:19:44] (03PS2) 10Giuseppe Lavagetto: site.pp: one role for dbstore2001.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/401552 [11:20:26] (03PS1) 10Elukey: role::prometheus::analytics: add configuration for jmx hadoop agents [puppet] - 10https://gerrit.wikimedia.org/r/402021 (https://phabricator.wikimedia.org/T177458) [11:21:33] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/9552/" [puppet] - 10https://gerrit.wikimedia.org/r/401552 (owner: 10Giuseppe Lavagetto) [11:25:00] (03PS2) 10Giuseppe Lavagetto: monitoring: create role::alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/401553 [11:30:16] (03CR) 10Filippo Giunchedi: [C: 031] role::prometheus::analytics: add configuration for jmx hadoop agents [puppet] - 10https://gerrit.wikimedia.org/r/402021 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [11:30:47] \o/ [11:30:54] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: create role::alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/401553 (owner: 10Giuseppe Lavagetto) [11:31:36] (03PS1) 10Ema: mtail: update varnishbackend.mtail regex [puppet] - 10https://gerrit.wikimedia.org/r/402022 (https://phabricator.wikimedia.org/T177199) [11:32:01] PROBLEM - Check systemd state on kubernetes1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:36:53] (03PS2) 10Ema: mtail: update varnishbackend.mtail regex [puppet] - 10https://gerrit.wikimedia.org/r/402022 (https://phabricator.wikimedia.org/T177199) [11:38:46] _joe_: puppet re-enabled on cp1008, it was disabled for digicert certs testing a while ago [11:41:01] RECOVERY - Check systemd state on kubernetes1003 is OK: OK - running: The system is fully operational [11:42:12] (03PS3) 10Ema: mtail: update varnishbackend.mtail regex [puppet] - 10https://gerrit.wikimedia.org/r/402022 (https://phabricator.wikimedia.org/T177199) [11:43:07] (03CR) 10Filippo Giunchedi: [C: 031] mtail: update varnishbackend.mtail regex [puppet] - 10https://gerrit.wikimedia.org/r/402022 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [11:43:48] (03CR) 10Ema: [C: 032] mtail: update varnishbackend.mtail regex [puppet] - 10https://gerrit.wikimedia.org/r/402022 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [11:51:46] (03PS2) 10Giuseppe Lavagetto: eventlogging: create compound role, consolidate hiera [puppet] - 10https://gerrit.wikimedia.org/r/401554 [11:55:21] (03CR) 10Giuseppe Lavagetto: eventlogging: create compound role, consolidate hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/401554 (owner: 10Giuseppe Lavagetto) [11:55:27] (03CR) 10Giuseppe Lavagetto: [C: 032] eventlogging: create compound role, consolidate hiera [puppet] - 10https://gerrit.wikimedia.org/r/401554 (owner: 10Giuseppe Lavagetto) [11:58:57] (03PS1) 10Filippo Giunchedi: smart: bump timeout to 60s [puppet] - 10https://gerrit.wikimedia.org/r/402023 (https://phabricator.wikimedia.org/T86552) [11:58:59] (03PS1) 10Filippo Giunchedi: smart: ignore drbd disks [puppet] - 10https://gerrit.wikimedia.org/r/402024 (https://phabricator.wikimedia.org/T86552) [12:00:13] !log upgrading HHVM on API canaries (mw1276-mw1279) to HHVM 3.18.6 [12:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:29] (03CR) 10Filippo Giunchedi: [C: 032] smart: bump timeout to 60s [puppet] - 10https://gerrit.wikimedia.org/r/402023 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [12:00:41] (03PS2) 10Filippo Giunchedi: smart: bump timeout to 60s [puppet] - 10https://gerrit.wikimedia.org/r/402023 (https://phabricator.wikimedia.org/T86552) [12:02:31] !log mobrovac@tin Started deploy [mathoid/deploy@c9957ce]: Mathoid v0.7.1 - T172767 [12:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:43] T172767: Prepare mathoid 0.7 release (tracking) - https://phabricator.wikimedia.org/T172767 [12:03:00] (03CR) 10Filippo Giunchedi: [C: 032] smart: ignore drbd disks [puppet] - 10https://gerrit.wikimedia.org/r/402024 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [12:03:04] (03PS2) 10Filippo Giunchedi: smart: ignore drbd disks [puppet] - 10https://gerrit.wikimedia.org/r/402024 (https://phabricator.wikimedia.org/T86552) [12:06:11] PROBLEM - Host kubernetes1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:06:21] RECOVERY - Host kubernetes1003 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [12:07:36] !log mobrovac@tin Finished deploy [mathoid/deploy@c9957ce]: Mathoid v0.7.1 - T172767 (duration: 05m 05s) [12:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:49] T172767: Prepare mathoid 0.7 release (tracking) - https://phabricator.wikimedia.org/T172767 [12:10:08] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3874954 (10Qgil) @Andrew @Austin @EBernhardson @Tgr @Samwilson @yuvipanda, as current admins of [[ https://tools.wmflabs.org/openstack-brows... [12:12:51] (03PS1) 10ArielGlenn: use strict var syntax in snapshot/dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/402029 [12:13:21] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [12:17:08] (03PS5) 10Volans: PuppetDB backend: add support for API v4 [software/cumin] - 10https://gerrit.wikimedia.org/r/399821 (https://phabricator.wikimedia.org/T182575) [12:20:43] * volans looking at icinga-wm [12:20:53] volans: it's me [12:21:01] akosiaris: ack :) [12:24:30] (03CR) 10Volans: "Tested on a local docker deployment of PuppetDB using:" [software/cumin] - 10https://gerrit.wikimedia.org/r/399821 (https://phabricator.wikimedia.org/T182575) (owner: 10Volans) [12:28:10] (03PS2) 10Elukey: role::prometheus::analytics: add configuration for jmx hadoop agents [puppet] - 10https://gerrit.wikimedia.org/r/402021 (https://phabricator.wikimedia.org/T177458) [12:28:36] Reedy: I think I've finished with https://phabricator.wikimedia.org/P6522 -- can we run the script again and check if there's anything left? [12:29:28] (03CR) 10Elukey: [C: 032] role::prometheus::analytics: add configuration for jmx hadoop agents [puppet] - 10https://gerrit.wikimedia.org/r/402021 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [12:30:00] CUSTOM - Host kubernetes1003 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [12:30:44] CUSTOM - Host kubernetes1003 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [12:31:49] CUSTOM - Host kubernetes1003 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [12:32:06] I like how custom notifications require a text but are not reporting it [12:32:23] they usually do, at least for services... [12:32:40] I've used them, at least for services and CRITICAL they do [12:34:19] CUSTOM - Host kubernetes1003 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [12:36:01] (03PS2) 10ArielGlenn: use strict var syntax in snapshot/dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/402029 [12:37:57] what on earth is this software doing... [12:38:21] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:40:07] (03PS3) 10ArielGlenn: use strict var syntax in snapshot/dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/402029 [12:40:45] (03CR) 10ArielGlenn: [C: 032] use strict var syntax in snapshot/dumps modules [puppet] - 10https://gerrit.wikimedia.org/r/402029 (owner: 10ArielGlenn) [12:40:50] legoktm: around? [12:41:33] !log mobrovac@tin Started deploy [restbase/deploy@66b7efe]: Switch Mathoid to Cassandra 3 and drop Cassandra 2 references - T179419 [12:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:45] T179419: Migrate mathoid storage from legacy to new strategy - https://phabricator.wikimedia.org/T179419 [12:45:38] !log mobrovac@tin Finished deploy [restbase/deploy@66b7efe]: Switch Mathoid to Cassandra 3 and drop Cassandra 2 references - T179419 (duration: 04m 05s) [12:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:23] 10Operations, 10RESTBase, 10Patch-For-Review, 10Services (blocked), 10User-mobrovac: Set up RESTBase on Cassandra 3 nodes - https://phabricator.wikimedia.org/T184110#3875011 (10mobrovac) [12:49:21] 10Operations, 10RESTBase, 10Patch-For-Review, 10Services (doing), 10User-mobrovac: Set up RESTBase on Cassandra 3 nodes - https://phabricator.wikimedia.org/T184110#3872628 (10mobrovac) [12:53:49] (03CR) 10Jcrespo: [C: 031] "It requires a sanitarium restart- but it is not high priority- x1 tables should not reach labsdbs anyway." [puppet] - 10https://gerrit.wikimedia.org/r/397623 (owner: 10Gergő Tisza) [12:53:52] !log upgrading HHVM on mwdebug* to 3.18.6 [12:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:06] (03CR) 10Jcrespo: "If this intends to run every day at 2:42, I do not think it will work- but I do not know which is the intended schedule (not shown on the " [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [13:09:59] (03PS1) 10Steinsplitter: Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow localizaion. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402040 [13:12:58] (03PS2) 10Steinsplitter: Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow onwiki localizaion of a commosn specific notice. All changes have been made onwiki yet. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402040 (https://phabricator.wikimedia.org/T183848) [13:17:55] !log upgrading HHVM on mw1180-mw1220 to 3.18.6 [13:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:27] (03PS9) 10Elukey: role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 [13:28:54] 10Operations, 10MediaWiki-Maintenance-scripts, 10Wikidata: Missing references to s8 on maintenance and cloud scripts (and potentially others) - https://phabricator.wikimedia.org/T184179#3875081 (10jcrespo) p:05Triage>03Normal [13:28:57] (03CR) 10jerkins-bot: [V: 04-1] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [13:30:19] argh [13:30:30] (03PS10) 10Elukey: role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 [13:30:58] (03CR) 10jerkins-bot: [V: 04-1] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [13:30:59] ah yes new violation, expected [13:31:14] (03PS1) 10Marostegui: Revert "Revert "db-eqiad.php: Depool db1079"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402043 [13:33:13] (03CR) 10Marostegui: [C: 032] Revert "Revert "db-eqiad.php: Depool db1079"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402043 (owner: 10Marostegui) [13:34:20] 10Operations, 10Data-Services, 10MediaWiki-Maintenance-scripts, 10Wikidata: Missing references to s8 on maintenance and cloud scripts (and potentially others) - https://phabricator.wikimedia.org/T184179#3875094 (10jcrespo) @bd808 @Andrew This contains the self-actionable part of T181643 (mostly dns-related... [13:34:42] (03Merged) 10jenkins-bot: Revert "Revert "db-eqiad.php: Depool db1079"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402043 (owner: 10Marostegui) [13:35:01] PROBLEM - very high load average likely xfs on ms-be1013 is CRITICAL: CRITICAL - load average: 147.44, 129.70, 119.13 [13:35:55] !log Stop replication in sync db1079 db1101:3317 T163190 [13:36:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1079 - T163190 (duration: 01m 02s) [13:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:05] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [13:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:26] (03PS1) 10Marostegui: Revert "Revert "Revert "db-eqiad.php: Depool db1079""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402045 [13:36:44] (03CR) 10jenkins-bot: Revert "Revert "db-eqiad.php: Depool db1079"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402043 (owner: 10Marostegui) [13:40:32] (03PS1) 10Jcrespo: mediawiki-maintenance: Run maintenance on new s8 replica set, too [puppet] - 10https://gerrit.wikimedia.org/r/402047 (https://phabricator.wikimedia.org/T184179) [13:40:38] (03CR) 10Marostegui: [C: 032] Revert "Revert "Revert "db-eqiad.php: Depool db1079""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402045 (owner: 10Marostegui) [13:41:31] PROBLEM - SSH on ms-be1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:06] (03Merged) 10jenkins-bot: Revert "Revert "Revert "db-eqiad.php: Depool db1079""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402045 (owner: 10Marostegui) [13:42:08] (03CR) 10Jcrespo: [C: 04-1] "Requires s8.dblist/s5.dblist update and potentially noc source code update, too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401436 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [13:42:19] (03CR) 10jenkins-bot: Revert "Revert "Revert "db-eqiad.php: Depool db1079""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402045 (owner: 10Marostegui) [13:43:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1079 - T163190 (duration: 01m 01s) [13:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:09] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [13:45:21] RECOVERY - SSH on ms-be1013 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u1 (protocol 2.0) [13:47:40] 10Operations, 10Data-Services, 10MediaWiki-Maintenance-scripts, 10Wikidata, 10Patch-For-Review: Missing references to s8 on maintenance and cloud scripts (and potentially others) - https://phabricator.wikimedia.org/T184179#3875127 (10jcrespo) ^see if that patch makes sense [13:48:48] 10Operations, 10Data-Services, 10MediaWiki-Maintenance-scripts, 10Wikidata, 10Patch-For-Review: Missing references to s8 on maintenance and cloud scripts (and potentially others) - https://phabricator.wikimedia.org/T184179#3875129 (10mark) p:05Normal>03High [13:58:13] (03PS5) 10EddieGP: Restrict sending mails to new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397768 (https://phabricator.wikimedia.org/T182541) [13:59:21] PROBLEM - DPKG on mw1209 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:59:53] (03PS4) 10Marostegui: db-eqiad.php: Point wikidatawiki to s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401436 (https://phabricator.wikimedia.org/T177208) [14:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180104T1400). [14:00:04] eddiegp and Steinsplitter: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:21] RECOVERY - DPKG on mw1209 is OK: All packages OK [14:00:26] * eddiegp is here [14:00:33] * Steinsplitter waves [14:00:38] I can SWAT. [14:01:03] o/ [14:01:11] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Point wikidatawiki to s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401436 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [14:01:14] (03CR) 10ArielGlenn: [C: 031] "aAs far as overlap with dumps usage of vslow, it's no better or worse than previous usage, which has gotten to be kind of crappy over time" [puppet] - 10https://gerrit.wikimedia.org/r/402047 (https://phabricator.wikimedia.org/T184179) (owner: 10Jcrespo) [14:01:32] PROBLEM - SSH on ms-be1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:34] (03PS3) 10Steinsplitter: Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow onwiki localizaion of a commosn specific notice. All changes have been made onwiki yet. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402040 (https://phabricator.wikimedia.org/T183848) [14:02:54] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397768 (https://phabricator.wikimedia.org/T182541) (owner: 10EddieGP) [14:03:05] (03CR) 10Luke081515: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402040 (https://phabricator.wikimedia.org/T183848) (owner: 10Steinsplitter) [14:03:13] (03CR) 10Luke081515: [C: 031] Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow onwiki localizaion of a commosn specific notice. All changes have been made onw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402040 (https://phabricator.wikimedia.org/T183848) (owner: 10Steinsplitter) [14:04:14] (03Merged) 10jenkins-bot: Restrict sending mails to new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397768 (https://phabricator.wikimedia.org/T182541) (owner: 10EddieGP) [14:06:52] (03CR) 10jenkins-bot: Restrict sending mails to new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397768 (https://phabricator.wikimedia.org/T182541) (owner: 10EddieGP) [14:07:13] eddiegp: Can you test your change? It's on mwdebug1002. [14:07:24] Niharika: doing [14:07:42] eddiegp: Wait a second. scap is taking too long. [14:08:11] PROBLEM - very high load average likely xfs on ms-be1013 is CRITICAL: CRITICAL - load average: 129.54, 140.99, 127.14 [14:08:31] RECOVERY - SSH on ms-be1013 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u1 (protocol 2.0) [14:08:55] Still waiting... [14:09:41] eddiegp: Done. [14:09:44] Niharika: It's already working though :) [14:09:53] eddiegp: Okay, great! [14:10:56] (03CR) 10Jcrespo: "I think the issue is that you created a file, not a ../../../ link" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401436 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [14:11:41] PROBLEM - SSH on ms-be1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:11:43] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Restrict sending mails to new users T182541 (duration: 01m 02s) [14:11:52] eddiegp: Synced^ [14:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:55] T182541: Update Wikimedia configuration to prevent some users from sending emails - https://phabricator.wikimedia.org/T182541 [14:12:00] (03PS4) 10Niharika29: Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow onwiki localizaion of a commosn specific notice. All changes have been made onwiki yet. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402040 (https://phabricator.wikimedia.org/T183848) (owner: 10Steinsplitter) [14:12:10] Niharika: Thanks :) [14:12:34] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9551/ says it's fine with a PCC across the fleet. Also a quick check in labs was fine as we" [puppet] - 10https://gerrit.wikimedia.org/r/401974 (owner: 10Alexandros Kosiaris) [14:12:49] (03PS2) 10Alexandros Kosiaris: ifguard $realm and $cluster with defined() [puppet] - 10https://gerrit.wikimedia.org/r/401974 [14:13:02] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402040 (https://phabricator.wikimedia.org/T183848) (owner: 10Steinsplitter) [14:13:42] RECOVERY - SSH on ms-be1013 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u1 (protocol 2.0) [14:14:12] PROBLEM - very high load average likely xfs on ms-be1013 is CRITICAL: CRITICAL - load average: 101.68, 118.30, 121.21 [14:14:21] (03Merged) 10jenkins-bot: Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow onwiki localizaion of a commosn specific notice. All changes have been made onwiki yet. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402040 (https://phabricator.wikimedia.org/T183848) (owner: 10Steinsplitter) [14:15:04] Steinsplitter: Yours is on mwdebug1002 as well. [14:15:04] (03PS5) 10Marostegui: db-eqiad.php: Point wikidatawiki to s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401436 (https://phabricator.wikimedia.org/T177208) [14:15:17] Niharika: thanks will test [14:16:11] (03CR) 10Jcrespo: "Hey, I asked to upgrade the dump hosts! And dumps should be faster on dedicated hardware, probably?" [puppet] - 10https://gerrit.wikimedia.org/r/402047 (https://phabricator.wikimedia.org/T184179) (owner: 10Jcrespo) [14:16:56] (03CR) 10jenkins-bot: Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow onwiki localizaion of a commosn specific notice. All changes have been made onwiki yet. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402040 (https://phabricator.wikimedia.org/T183848) (owner: 10Steinsplitter) [14:17:55] (03CR) 10ArielGlenn: [C: 031] "They run slower due to other concerns, such as wikidata eating us alive, nothing for your todo list ;-)" [puppet] - 10https://gerrit.wikimedia.org/r/402047 (https://phabricator.wikimedia.org/T184179) (owner: 10Jcrespo) [14:19:08] Niharika: it is synchronized yet? [14:19:19] Steinsplitter: Yep. [14:19:45] perfect, thanks. [14:20:05] Steinsplitter: Oh you mean synchornised everywhere? [14:20:11] No, I was waiting on you testing it. [14:20:20] synchronized* [14:20:51] one sec. [14:21:58] (03CR) 10Jcrespo: [C: 031] "That should be it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401436 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [14:22:05] Niharika: works. [14:22:18] Alright, it's going live then. [14:24:24] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Adding Movepage-summary to wgForceUIMsgAsContentMsg T183848 (duration: 01m 02s) [14:24:31] Steinsplitter: Done. It should be out everywhere now. [14:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:38] T183848: MediaWiki:Movepage-summary is not forced to content language - https://phabricator.wikimedia.org/T183848 [14:25:38] I guess that's all for SWAT today. [14:25:45] Niharika> thanks! [14:25:51] PROBLEM - SSH on ms-be1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:11] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 24 probes of 291 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:27:24] (03PS1) 10Elukey: profile::hadoop: set hiera defaults to ease labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/402050 (https://phabricator.wikimedia.org/T167790) [14:28:41] (03CR) 10Ottomata: [C: 031] profile::hadoop: set hiera defaults to ease labs deployments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402050 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:32:12] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 10 probes of 291 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:33:43] 10Operations, 10Puppet, 10Puppet-infrastructure-modernization: Fix unknown variables warning that occur with puppet 4.x - https://phabricator.wikimedia.org/T184186#3875233 (10akosiaris) [14:34:51] RECOVERY - SSH on ms-be1013 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u1 (protocol 2.0) [14:35:22] PROBLEM - very high load average likely xfs on ms-be1013 is CRITICAL: CRITICAL - load average: 88.80, 121.97, 129.62 [14:35:48] (03CR) 10Ottomata: [C: 031] "Cool! +1 for this as a no-op :)" [puppet] - 10https://gerrit.wikimedia.org/r/401927 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:36:09] (03PS37) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [14:36:17] (03CR) 10Ottomata: [C: 031] profile::hadoop::master: remove the last hadoop cdh auto-lookup [puppet] - 10https://gerrit.wikimedia.org/r/401904 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:36:33] akosiaris: should we restart ircecho to have icinga-wm_ without underscore? [14:38:20] the code is a mess btw (not that I did not already know, my hands are in that one as well), I 've taken a step back to reevaluate a bit [14:38:29] volans: and done [14:38:36] ok, and thanks! [14:43:23] (03PS2) 10Elukey: profile::hadoop::master: remove the last hadoop cdh auto-lookup [puppet] - 10https://gerrit.wikimedia.org/r/401904 (https://phabricator.wikimedia.org/T167790) [14:44:03] (03CR) 10Elukey: [C: 032] profile::hadoop::master: remove the last hadoop cdh auto-lookup [puppet] - 10https://gerrit.wikimedia.org/r/401904 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:44:19] (03PS2) 10Alexandros Kosiaris: Disable docker bridge in production/staging [puppet] - 10https://gerrit.wikimedia.org/r/401929 [14:44:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Disable docker bridge in production/staging [puppet] - 10https://gerrit.wikimedia.org/r/401929 (owner: 10Alexandros Kosiaris) [14:44:35] (03CR) 10Jcrespo: "See parsercachepurging.pp for what I mean about the cron scheduling. Not voting -1 because maybe weekly purges are intended." [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [14:44:58] (03PS4) 10Jcrespo: Add ReadingLists tables to Toolforge filter config [puppet] - 10https://gerrit.wikimedia.org/r/397623 (owner: 10Gergő Tisza) [14:45:15] (03PS4) 10Elukey: Refactor thorium's roles in one [puppet] - 10https://gerrit.wikimedia.org/r/401927 (https://phabricator.wikimedia.org/T167790) [14:46:20] (03CR) 10Jcrespo: [C: 032] Add ReadingLists tables to Toolforge filter config [puppet] - 10https://gerrit.wikimedia.org/r/397623 (owner: 10Gergő Tisza) [14:46:51] (03CR) 10Elukey: [C: 032] Refactor thorium's roles in one [puppet] - 10https://gerrit.wikimedia.org/r/401927 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:46:56] (03PS5) 10Elukey: Refactor thorium's roles in one [puppet] - 10https://gerrit.wikimedia.org/r/401927 (https://phabricator.wikimedia.org/T167790) [14:47:06] 10Operations, 10Puppet, 10Puppet-infrastructure-modernization: Fix unknown variables warning that occur with puppet 4.x - https://phabricator.wikimedia.org/T184186#3875271 (10Paladox) For 'passwords::gerrit::gerrit_phab_token'. at /etc/puppet/modules/gerrit/manifests/jetty.pp:43:19 Can be fixed anytime but... [14:49:58] (03PS1) 10Jcrespo: mariadb: Move db2046 socket location to /run [puppet] - 10https://gerrit.wikimedia.org/r/402054 (https://phabricator.wikimedia.org/T148507) [14:51:32] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[init_superset] [14:51:41] this is me --^ [14:51:41] fixing [14:51:52] PROBLEM - Host kubernetes1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:21] RECOVERY - Host kubernetes1002 is UP: PING WARNING - Packet loss = 61%, RTA = 0.19 ms [14:54:15] !log restart db2046 database to move socket location [14:54:18] (03PS1) 10Filippo Giunchedi: prometheus: add backend varnish mtail job [puppet] - 10https://gerrit.wikimedia.org/r/402055 (https://phabricator.wikimedia.org/T177199) [14:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:59] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2046 socket location to /run [puppet] - 10https://gerrit.wikimedia.org/r/402054 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [14:58:05] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063#3875293 (10mark) [15:00:05] PROBLEM - HHVM rendering on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:00:35] PROBLEM - Host kubernetes1004 is DOWN: PING CRITICAL - Packet loss = 100% [15:00:54] RECOVERY - Host kubernetes1004 is UP: PING WARNING - Packet loss = 50%, RTA = 84.17 ms [15:00:55] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 73461 bytes in 0.177 second response time [15:01:34] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:01:55] PROBLEM - Host kubestage1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:23] (03PS1) 10Filippo Giunchedi: hieradata: extend eqiad SMART checking deployment [puppet] - 10https://gerrit.wikimedia.org/r/402056 (https://phabricator.wikimedia.org/T86552) [15:02:44] PROBLEM - Host kubestage1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:03:01] !log upgrading HHVM on eqiad image scalers to 3.18.6 [15:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:14] RECOVERY - Host kubestage1002 is UP: PING WARNING - Packet loss = 61%, RTA = 0.26 ms [15:03:24] RECOVERY - Host kubestage1001 is UP: PING OK - Packet loss = 0%, RTA = 1.37 ms [15:03:55] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402057 [15:05:51] (03PS2) 10Elukey: profile::hadoop: set hiera defaults to ease labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/402050 (https://phabricator.wikimedia.org/T167790) [15:06:15] (03CR) 10Andrew Bogott: [C: 031] "The lab* hosts in this look fine to me. You're only checking single servers as a proof-of-concept, I take it? Ultimately I'd like all th" [puppet] - 10https://gerrit.wikimedia.org/r/402056 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:06:27] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402057 (owner: 10Marostegui) [15:06:35] RECOVERY - very high load average likely xfs on ms-be1013 is OK: OK - load average: 68.53, 70.56, 79.82 [15:07:54] PROBLEM - HHVM rendering on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:55] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402057 (owner: 10Marostegui) [15:08:06] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402057 (owner: 10Marostegui) [15:08:44] RECOVERY - HHVM rendering on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 73453 bytes in 0.260 second response time [15:09:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1101:3317 - T163190 (duration: 01m 02s) [15:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:21] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [15:10:29] (03CR) 10Elukey: "no op: https://puppet-compiler.wmflabs.org/compiler02/9560/" [puppet] - 10https://gerrit.wikimedia.org/r/402050 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [15:11:33] (03CR) 10Filippo Giunchedi: "> The lab* hosts in this look fine to me. You're only checking" [puppet] - 10https://gerrit.wikimedia.org/r/402056 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:11:44] PROBLEM - HHVM rendering on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:25] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [15:12:34] RECOVERY - HHVM rendering on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 73453 bytes in 0.313 second response time [15:12:54] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [15:13:29] (03CR) 10Andrew Bogott: [C: 031] "> If you have a more representative sample of labvirt hosts you'd like to have covered first let me know!" [puppet] - 10https://gerrit.wikimedia.org/r/402056 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:15:38] (03PS1) 10Jcrespo: mariadb: Move db2060 socket to /run [puppet] - 10https://gerrit.wikimedia.org/r/402058 (https://phabricator.wikimedia.org/T148507) [15:16:03] (03PS1) 10Volans: Migration to Python 3 [software/cumin] - 10https://gerrit.wikimedia.org/r/402059 [15:16:14] PROBLEM - DPKG on kubernetes1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:16:24] PROBLEM - DPKG on kubernetes1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:16:25] PROBLEM - DPKG on kubernetes2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:16:25] PROBLEM - DPKG on kubernetes2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:16:35] PROBLEM - DPKG on kubernetes1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:16:45] PROBLEM - DPKG on kubernetes2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:16:54] PROBLEM - DPKG on kubernetes2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:16:54] PROBLEM - DPKG on kubernetes1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:17:35] PROBLEM - puppet last run on kubernetes2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[openssh-server],Exec[set debconf flag seen for wireshark-common/install-setuid] [15:17:46] !log upgrade and restart db2060 [15:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:44] did someone do a dist-upgrade on kubernetes* those are stuck in a debconf prompt for openssh [15:19:11] (03CR) 10jerkins-bot: [V: 04-1] Migration to Python 3 [software/cumin] - 10https://gerrit.wikimedia.org/r/402059 (owner: 10Volans) [15:19:15] PROBLEM - puppet last run on kubernetes2002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[openssh-server],Exec[set debconf flag seen for wireshark-common/install-setuid] [15:19:21] needs to be done with -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="force-confold" [15:19:35] I did [15:19:46] but not a dist-upgrade [15:19:54] simply apt upgrade [15:19:55] was puppet? [15:19:57] ah [15:20:04] anyway fixing [15:20:06] that'll trigger it as well [15:20:32] (03PS1) 10Ema: cache_canary: use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/402061 [15:21:27] (03PS2) 10Volans: Migration to Python 3 [software/cumin] - 10https://gerrit.wikimedia.org/r/402059 [15:21:31] ideally we'd find a way to fix our puppet sshd config not to interfere with the default config shipped in the package, needs some poking [15:21:45] RECOVERY - DPKG on kubernetes2004 is OK: All packages OK [15:21:54] RECOVERY - DPKG on kubernetes1002 is OK: All packages OK [15:21:54] RECOVERY - DPKG on kubernetes2001 is OK: All packages OK [15:22:16] RECOVERY - DPKG on kubernetes1004 is OK: All packages OK [15:22:24] RECOVERY - DPKG on kubernetes1003 is OK: All packages OK [15:22:24] RECOVERY - DPKG on kubernetes2003 is OK: All packages OK [15:22:25] RECOVERY - DPKG on kubernetes2002 is OK: All packages OK [15:22:25] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2060 socket to /run [puppet] - 10https://gerrit.wikimedia.org/r/402058 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [15:22:35] RECOVERY - puppet last run on kubernetes2001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:22:35] RECOVERY - DPKG on kubernetes1001 is OK: All packages OK [15:23:38] (03CR) 10Ottomata: [V: 032 C: 032] "Ah! great, yes." [puppet] - 10https://gerrit.wikimedia.org/r/402061 (owner: 10Ema) [15:23:42] (03PS2) 10Ottomata: cache_canary: use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/402061 (owner: 10Ema) [15:23:53] ema: shall I merge and apply that? [15:24:15] RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:25:13] ottomata: yeah, that patch fixes one of pinkunicorn puppetfails so let's do that :) [15:25:24] thanks! [15:26:41] 10Operations, 10Ops-Access-Requests: Requesting access to Production SSH, statistics-privatedata-users, analytics-privatedata-users for imarlier - https://phabricator.wikimedia.org/T184190#3875340 (10Imarlier) [15:26:46] _joe_: the other puppetfail is likely due to adding ipsec to basic cache roles I think [15:26:57] https://puppet-compiler.wmflabs.org/compiler02/9562/cp1008.wikimedia.org/change.cp1008.wikimedia.org.err [15:27:20] 10Operations, 10Patch-For-Review: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3875354 (10Marostegui) [15:27:25] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3875352 (10Marostegui) 05Open>03Resolved >>! In T180788#3854799, @Marostegui wrote: > This has been all set. > Servers replicate between each other (db1111 being the master). > They con... [15:27:25] moritzm: anyway why upgrading openssh-server generates an SSH2 DSA key ? [15:27:28] Creating SSH2 DSA key; this may take some time ... [15:27:28] 1024 SHA256:fcRTTDIGn+z2JzgwZAx5RAuuG19jEK9tH9axQxhMlME root@kubestage1001 (DSA) [15:27:39] I thought it was disabled in 2016 [15:27:49] in fact that's what the changelog says [15:27:49] (03PS1) 10Ottomata: Refine mediawiki job queue events into Hive event database [puppet] - 10https://gerrit.wikimedia.org/r/402064 [15:28:01] _joe_: oh, yes, that's because role::cache::canary includes role::cache::text [15:28:05] (03PS1) 10Jcrespo: mariadb: Move db2067 socket to /run [puppet] - 10https://gerrit.wikimedia.org/r/402065 (https://phabricator.wikimedia.org/T148507) [15:30:01] (03PS2) 10Jcrespo: mariadb: Move db2067 socket to /run [puppet] - 10https://gerrit.wikimedia.org/r/402065 (https://phabricator.wikimedia.org/T148507) [15:30:31] (03PS1) 10Ema: role::cache::text: do not include ipsec role for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/402067 [15:30:33] akosiaris: the postinst checks whether sshd_config configures "HostKey" and if that's the case generates host keys for all configured variants [15:31:06] (03CR) 10Ottomata: [C: 032] "Looooks good. https://puppet-compiler.wmflabs.org/compiler02/9563/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/402064 (owner: 10Ottomata) [15:31:07] to fully disable it from our config we need the ganeti from stretch 9.2 or a backport, see https://phabricator.wikimedia.org/T177371 [15:31:10] (03PS2) 10Ottomata: Refine mediawiki job queue events into Hive event database [puppet] - 10https://gerrit.wikimedia.org/r/402064 [15:31:14] (03PS1) 10Rush: tools: need overlay module for overlay2 for k8s [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) [15:31:20] (03CR) 10Ottomata: [V: 032 C: 032] Refine mediawiki job queue events into Hive event database [puppet] - 10https://gerrit.wikimedia.org/r/402064 (owner: 10Ottomata) [15:31:31] (03PS1) 10Filippo Giunchedi: cassandra: use prometheus-jmx-exporter Debian package [puppet] - 10https://gerrit.wikimedia.org/r/402069 (https://phabricator.wikimedia.org/T181728) [15:31:33] (03PS1) 10Filippo Giunchedi: cassandra: switch to using jmx-exporter jar from Debian package [puppet] - 10https://gerrit.wikimedia.org/r/402070 (https://phabricator.wikimedia.org/T181728) [15:31:36] !log demon@tin Synchronized php-1.31.0-wmf.15/extensions/ActiveAbstract/: unbreak, T184177 (duration: 01m 02s) [15:31:44] apergos: ^^^ [15:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:45] T184177: Abstract dumps broken by MW deploy - https://phabricator.wikimedia.org/T184177 [15:31:48] thank you [15:31:51] (03CR) 10jerkins-bot: [V: 04-1] tools: need overlay module for overlay2 for k8s [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) (owner: 10Rush) [15:31:52] yw [15:32:00] let me have a run of that command on my test host again [15:32:04] moritzm: ah nice thanks [15:32:25] 10Operations, 10Ops-Access-Requests: Requesting access to Production SSH, statistics-privatedata-users, analytics-privatedata-users, perf-team for imarlier - https://phabricator.wikimedia.org/T184190#3875392 (10Imarlier) [15:32:53] (03PS2) 10Filippo Giunchedi: hieradata: extend eqiad SMART checking deployment [puppet] - 10https://gerrit.wikimedia.org/r/402056 (https://phabricator.wikimedia.org/T86552) [15:33:02] PROBLEM - Check size of conntrack table on mw1335 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [15:33:36] moritzm: so what basically happened in this case is that I upgraded for the first time ever the openssh-server package since those hosts were installed (and the sshd_config we ship was applied afterwards). Ok makes sense [15:33:37] thanks! [15:34:47] (03PS2) 10Ema: role::cache::text: do not include ipsec role for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/402067 [15:34:57] (03CR) 10Ema: [V: 032 C: 032] role::cache::text: do not include ipsec role for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/402067 (owner: 10Ema) [15:35:02] RECOVERY - Check size of conntrack table on mw1335 is OK: OK: nf_conntrack is 72 % full [15:35:12] (03CR) 10Filippo Giunchedi: "> > If you have a more representative sample of labvirt hosts you'd" [puppet] - 10https://gerrit.wikimedia.org/r/402056 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:35:15] (03PS2) 10Rush: tools: need overlay module for overlay2 for k8s [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) [15:35:28] (03PS3) 10Rush: tools: need overlay module for overlay2 for k8s [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) [15:36:15] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2067 socket to /run [puppet] - 10https://gerrit.wikimedia.org/r/402065 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [15:36:21] (03PS3) 10Jcrespo: mariadb: Move db2067 socket to /run [puppet] - 10https://gerrit.wikimedia.org/r/402065 (https://phabricator.wikimedia.org/T148507) [15:36:34] !log upgrade and restart db2067 [15:36:37] (03CR) 10jerkins-bot: [V: 04-1] tools: need overlay module for overlay2 for k8s [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) (owner: 10Rush) [15:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:15] looks good, thanks much [15:37:22] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:37:49] _joe_: fixed with https://gerrit.wikimedia.org/r/402067 FYI [15:37:52] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:40:45] (03PS3) 10Elukey: profile::hadoop: set hiera defaults to ease labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/402050 (https://phabricator.wikimedia.org/T167790) [15:41:12] PROBLEM - HHVM rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:22] PROBLEM - HHVM rendering on mw1298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:57] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler03/9567/" [puppet] - 10https://gerrit.wikimedia.org/r/402070 (https://phabricator.wikimedia.org/T181728) (owner: 10Filippo Giunchedi) [15:42:02] RECOVERY - HHVM rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 73453 bytes in 0.274 second response time [15:42:13] RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 73455 bytes in 1.996 second response time [15:42:15] (03PS1) 10Ottomata: Use intermediate script for json refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/402072 [15:42:21] (03PS2) 10Ottomata: Use intermediate script for json refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/402072 [15:42:52] PROBLEM - Host kubernetes1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:43:02] RECOVERY - Host kubernetes1002 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [15:44:12] !log upgrade and restart db2076 [15:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:22] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [15:47:02] PROBLEM - Host kubernetes1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:22] RECOVERY - Host kubernetes1002 is UP: PING WARNING - Packet loss = 93%, RTA = 192.30 ms [15:48:32] (03PS4) 10Elukey: profile::hadoop: set hiera defaults to ease labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/402050 (https://phabricator.wikimedia.org/T167790) [15:49:07] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop: set hiera defaults to ease labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/402050 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [15:49:08] 10Operations, 10Ops-Access-Requests: Requesting access to Production SSH, statistics-privatedata-users, analytics-privatedata-users, perf-team for imarlier - https://phabricator.wikimedia.org/T184190#3875434 (10Nuria) Approved [15:49:27] (03PS3) 10Ottomata: Use intermediate script for json refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/402072 [15:49:32] PROBLEM - Check systemd state on kubernetes1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:52:07] (03PS5) 10Elukey: profile::hadoop: set hiera defaults to ease labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/402050 (https://phabricator.wikimedia.org/T167790) [15:53:26] (03PS4) 10Andrew Bogott: tools: need overlay module for overlay2 for k8s [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) (owner: 10Rush) [15:53:51] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/9571/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/402072 (owner: 10Ottomata) [15:53:59] 10Operations, 10Data-Services, 10MediaWiki-Maintenance-scripts, 10Wikidata, 10Patch-For-Review: Missing references to s8 on maintenance and cloud scripts (and potentially others) - https://phabricator.wikimedia.org/T184179#3875081 (10chasemp) Just an extra ping for @bd808 as he wrote most of what I think... [15:55:36] (03CR) 10Elukey: [C: 032] "a wonderful no-op https://puppet-compiler.wmflabs.org/compiler03/9572/" [puppet] - 10https://gerrit.wikimedia.org/r/402050 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [15:55:43] (03PS6) 10Elukey: profile::hadoop: set hiera defaults to ease labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/402050 (https://phabricator.wikimedia.org/T167790) [16:03:37] RECOVERY - Check systemd state on kubernetes1002 is OK: OK - unknown: The operational state could not be determined, due to lack of resources or another error cause. [16:04:07] all the kubernetes hosts rebooting is me (in case icinga-wm is fast enough) [16:04:37] PROBLEM - Host kubernetes2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:04:47] PROBLEM - Host kubernetes1004 is DOWN: PING CRITICAL - Packet loss = 100% [16:04:57] PROBLEM - Host kubernetes2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:04:58] PROBLEM - Host kubernetes1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:04:58] PROBLEM - Host kubestage1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:08] icinga-wm is faster than you think [16:05:17] PROBLEM - Host kubestage1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:17] PROBLEM - Host kubernetes2004 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:17] RECOVERY - Host kubernetes1003 is UP: PING OK - Packet loss = 0%, RTA = 14.23 ms [16:05:17] PROBLEM - Host kubernetes2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:28] RECOVERY - Host kubernetes1004 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [16:05:37] RECOVERY - Host kubernetes2004 is UP: PING WARNING - Packet loss = 61%, RTA = 57.96 ms [16:05:37] RECOVERY - Host kubernetes2003 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [16:05:37] RECOVERY - Host kubernetes2001 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [16:05:47] RECOVERY - Host kubestage1001 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [16:05:49] RECOVERY - Host kubernetes2002 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [16:05:49] RECOVERY - Host kubestage1002 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [16:06:06] it maybe has to do something with python threads starting python threads starting forever loops [16:06:08] :P [16:06:24] * akosiaris has a headache once more by reading ircecho.py [16:07:19] akosiaris: one time I fixed a bug in it and tha tresulted in revealing like 5 more scoping and shadow var name issues and I have fled from it ever since [16:08:30] chasemp: I can feel you. I 've done the exact same thing. I 've even started reading python-irclib code. That's when I was terrified and decided I don't want much to do with it. And it only returned like some bad movie with a vengeance [16:09:55] is ircecho.py running on kubernetes yet :D [16:10:09] sure [16:10:36] for like -2147483648 days already [16:10:41] oh wait.... [16:10:48] should replace our IRC bots with a proper bot that consumes from kafka [16:11:00] and a bunch of kafka producers emitting notable events :) [16:11:33] paravoid is volunteering to rewrite ircecho as kafkaecho!!! [16:11:36] <:o) [16:11:40] I just might :) [16:11:44] paravoid has a team now [16:11:48] he already sorta has! [16:12:08] nah, that was an IRC server! [16:12:13] OHHH [16:12:15] the IRC bot [16:12:16] haha [16:12:17] yes! [16:12:18] YES! [16:12:24] bring it on IIIIN https://wikitech.wikimedia.org/wiki/User:Ottomata/Stream_Data_Platform#Stream_Data_Platform [16:12:54] I proposed running fedmsg in the past [16:13:00] ( http://www.fedmsg.com/en/stable/ ) [16:13:12] but nowadays should probably just leverage kafka instead [16:14:58] !log upgrade and restart db2087 (s6/s7) [16:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:17] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:18:36] paravoid: have you seen https://github.com/confluentinc/ksql#use-cases-and-examples ? [16:19:23] ksql you mean, or one of those specific examples? [16:19:37] ksql [16:20:01] --- Log opened Mon Aug 28 23:10:03 2017 [16:20:01] 23:10 did you see https://www.confluent.io/blog/ksql-open-source-streaming-sql-for-apache-kafka/ ? [16:20:04] 23:11 whoooa no i didn't [16:20:08] :P [16:20:11] HAHAHHA [16:20:24] i'm sure i'll ask you again in a few months too [16:20:29] :D [16:21:02] i finally got their more recent code to build and run pointed at kafka-jumbo [16:21:05] still need to play some more [16:21:10] cool! [16:22:04] (03PS5) 10Andrew Bogott: tools: need overlay module for overlay2 for k8s [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) (owner: 10Rush) [16:22:06] (03PS1) 10Andrew Bogott: kmod blacklist: allow ensure => absent for a given blacklist [puppet] - 10https://gerrit.wikimedia.org/r/402075 (https://phabricator.wikimedia.org/T184018) [16:26:20] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: SCAP: Upload debian package version 3.7.4-3 - https://phabricator.wikimedia.org/T182347#3875582 (10thcipriani) [16:26:22] 10Operations, 10Scap, 10Patch-For-Review: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046#3875579 (10thcipriani) 05Open>03Resolved a:03akosiaris This one was resolved with the release of scap 3.7.4-3 [16:29:36] 10Operations, 10ops-eqiad, 10DC-Ops: cp1066's DRAC not responding to SSH - https://phabricator.wikimedia.org/T184196#3875589 (10ema) [16:29:50] 10Operations, 10ops-eqiad, 10DC-Ops: cp1066's DRAC not responding to SSH - https://phabricator.wikimedia.org/T184196#3875603 (10ema) p:05Triage>03Normal [16:30:06] 10Operations, 10ops-eqiad, 10DC-Ops: cp1066's DRAC not responding to SSH - https://phabricator.wikimedia.org/T184196#3875589 (10ema) [16:31:12] (03CR) 10Andrew Bogott: "Compiler output (effective no-op) here: https://puppet-compiler.wmflabs.org/compiler02/9575/puppetmaster1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) (owner: 10Rush) [16:32:42] (03CR) 10Alexandros Kosiaris: [C: 031] hieradata: extend eqiad SMART checking deployment [puppet] - 10https://gerrit.wikimedia.org/r/402056 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [16:37:09] (03PS3) 10Filippo Giunchedi: hieradata: extend eqiad SMART checking deployment [puppet] - 10https://gerrit.wikimedia.org/r/402056 (https://phabricator.wikimedia.org/T86552) [16:37:30] (03CR) 10Rush: tools: need overlay module for overlay2 for k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) (owner: 10Rush) [16:38:45] !log upgrade and restart db2089 (s5/s6) [16:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:24] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: extend eqiad SMART checking deployment [puppet] - 10https://gerrit.wikimedia.org/r/402056 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [16:46:12] 10Operations, 10ops-codfw, 10Cloud-VPS: Connect labtestvirt2003 eth1 and eth2 interface(s) to switch fabric - https://phabricator.wikimedia.org/T183167#3875655 (10Papaul) @RobH Row B rack B1 labtestvirt2002 ge-1/0/12 [16:46:55] 10Operations, 10ops-codfw, 10Cloud-VPS: Connect labtestvirt2003 eth1 and eth2 interface(s) to switch fabric - https://phabricator.wikimedia.org/T183167#3875657 (10Papaul) a:05Papaul>03RobH [16:50:26] (03CR) 10Rush: [C: 031] "makes sense to me" [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) (owner: 10Rush) [16:50:49] (03CR) 10Rush: [C: 031] "Have to manage state here for good or ill to leave existing bans or future removed bans in a predictive state" [puppet] - 10https://gerrit.wikimedia.org/r/402075 (https://phabricator.wikimedia.org/T184018) (owner: 10Andrew Bogott) [16:52:03] (03PS2) 10Andrew Bogott: kmod blacklist: allow ensure => absent for a given blacklist [puppet] - 10https://gerrit.wikimedia.org/r/402075 (https://phabricator.wikimedia.org/T184018) [16:52:05] (03PS1) 10Herron: mx: add civicrm.wikimedia.org to donate_domains [puppet] - 10https://gerrit.wikimedia.org/r/402078 (https://phabricator.wikimedia.org/T184120) [16:53:52] (03CR) 10Andrew Bogott: [C: 032] kmod blacklist: allow ensure => absent for a given blacklist [puppet] - 10https://gerrit.wikimedia.org/r/402075 (https://phabricator.wikimedia.org/T184018) (owner: 10Andrew Bogott) [16:54:38] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3875676 (10Gehel) A .deb of prometheus jxm_exporter is now available. I started to experiment on `deployment-elastic06`. Elasticsea... [16:54:45] (03CR) 10Herron: [C: 032] mx: add civicrm.wikimedia.org to donate_domains [puppet] - 10https://gerrit.wikimedia.org/r/402078 (https://phabricator.wikimedia.org/T184120) (owner: 10Herron) [16:54:54] (03PS2) 10Herron: mx: add civicrm.wikimedia.org to donate_domains [puppet] - 10https://gerrit.wikimedia.org/r/402078 (https://phabricator.wikimedia.org/T184120) [16:56:27] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3875680 (10chasemp) [17:00:05] godog, moritzm, and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180104T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:49] ah, i'll take a look as well [17:01:11] nothing in it.. ok [17:01:20] because already merged :) [17:02:28] \o/ [17:03:58] (03CR) 10Eevans: [C: 031] "LGTM (when the timing is appropriate)" [puppet] - 10https://gerrit.wikimedia.org/r/401784 (https://phabricator.wikimedia.org/T184110) (owner: 10Mobrovac) [17:06:48] (03PS6) 10Andrew Bogott: tools: need overlay module for overlay2 for k8s [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) (owner: 10Rush) [17:10:07] (03CR) 10jerkins-bot: [V: 04-1] tools: need overlay module for overlay2 for k8s [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) (owner: 10Rush) [17:11:16] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) (owner: 10Rush) [17:11:54] (03PS2) 10Herron: add mx records for civicrm.wikimedia.org pointing to production mx's [dns] - 10https://gerrit.wikimedia.org/r/401604 (https://phabricator.wikimedia.org/T184120) (owner: 10Jgreen) [17:12:05] (03PS2) 10Dzahn: apache: add httpd module as a replacement [puppet] - 10https://gerrit.wikimedia.org/r/400100 (owner: 10Giuseppe Lavagetto) [17:12:28] (03CR) 10Andrew Bogott: [C: 032] tools: need overlay module for overlay2 for k8s [puppet] - 10https://gerrit.wikimedia.org/r/402068 (https://phabricator.wikimedia.org/T184018) (owner: 10Rush) [17:14:07] (03CR) 10Herron: [C: 032] add mx records for civicrm.wikimedia.org pointing to production mx's [dns] - 10https://gerrit.wikimedia.org/r/401604 (https://phabricator.wikimedia.org/T184120) (owner: 10Jgreen) [17:14:14] (03PS3) 10Herron: add mx records for civicrm.wikimedia.org pointing to production mx's [dns] - 10https://gerrit.wikimedia.org/r/401604 (https://phabricator.wikimedia.org/T184120) (owner: 10Jgreen) [17:16:35] (03CR) 10Ema: [C: 031] prometheus: add backend varnish mtail job [puppet] - 10https://gerrit.wikimedia.org/r/402055 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [17:17:33] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:19:12] (03PS2) 10Filippo Giunchedi: prometheus: add backend varnish mtail job [puppet] - 10https://gerrit.wikimedia.org/r/402055 (https://phabricator.wikimedia.org/T177199) [17:19:25] (03CR) 10Chad: "This has been running beta, should we land it?" [puppet] - 10https://gerrit.wikimedia.org/r/386869 (https://phabricator.wikimedia.org/T179371) (owner: 10Filippo Giunchedi) [17:20:00] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add backend varnish mtail job [puppet] - 10https://gerrit.wikimedia.org/r/402055 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [17:20:33] (03PS1) 10Alexandros Kosiaris: ircecho: Remove redundant thread [puppet] - 10https://gerrit.wikimedia.org/r/402081 [17:24:37] (03PS2) 10Alexandros Kosiaris: ircecho: Remove redundant thread [puppet] - 10https://gerrit.wikimedia.org/r/402081 [17:26:31] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/402069 (https://phabricator.wikimedia.org/T181728) (owner: 10Filippo Giunchedi) [17:26:44] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/402070 (https://phabricator.wikimedia.org/T181728) (owner: 10Filippo Giunchedi) [17:26:54] (03PS1) 10Elukey: role::analytics_cluster: avoid to expicitly instance the standard class [puppet] - 10https://gerrit.wikimedia.org/r/402084 (https://phabricator.wikimedia.org/T167790) [17:27:04] (03PS1) 10Herron: exim: add civicrm.wikimedia.org to wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/402085 (https://phabricator.wikimedia.org/T184120) [17:27:53] (03CR) 10Herron: [C: 032] exim: add civicrm.wikimedia.org to wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/402085 (https://phabricator.wikimedia.org/T184120) (owner: 10Herron) [17:28:03] (03PS1) 10BryanDavis: wikireplica_dns: Add s8 shard [puppet] - 10https://gerrit.wikimedia.org/r/402086 (https://phabricator.wikimedia.org/T184179) [17:28:05] (03PS1) 10BryanDavis: wmcs: Add s8 to maintain-meta_p [puppet] - 10https://gerrit.wikimedia.org/r/402087 (https://phabricator.wikimedia.org/T184179) [17:32:01] (03CR) 10Andrew Bogott: [C: 032] wmcs: Add s8 to maintain-meta_p [puppet] - 10https://gerrit.wikimedia.org/r/402087 (https://phabricator.wikimedia.org/T184179) (owner: 10BryanDavis) [17:32:40] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9577/ - noop" [puppet] - 10https://gerrit.wikimedia.org/r/402084 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [17:32:46] (03PS2) 10Elukey: role::analytics_cluster: avoid to expicitly instance the standard class [puppet] - 10https://gerrit.wikimedia.org/r/402084 (https://phabricator.wikimedia.org/T167790) [17:33:03] (03CR) 10Andrew Bogott: [C: 032] wikireplica_dns: Add s8 shard [puppet] - 10https://gerrit.wikimedia.org/r/402086 (https://phabricator.wikimedia.org/T184179) (owner: 10BryanDavis) [17:33:15] (03CR) 10Dzahn: [C: 031] "the type aliases added into wmflib seem a little unrelated to the httpd module itself, maybe they should be a separate change" [puppet] - 10https://gerrit.wikimedia.org/r/400100 (owner: 10Giuseppe Lavagetto) [17:34:25] 10Operations, 10ops-eqiad: mw1191 ipmi-sel cpu errors - https://phabricator.wikimedia.org/T179640#3731680 (10RobH) This host now also has a failed sector check: ``` This message was generated by the smartd daemon running on: host name: mw1191 DNS domain: eqiad.wmnet The following warning/error was l... [17:34:44] (03PS2) 10Andrew Bogott: wikireplica_dns: Add s8 shard [puppet] - 10https://gerrit.wikimedia.org/r/402086 (https://phabricator.wikimedia.org/T184179) (owner: 10BryanDavis) [17:35:14] (03PS2) 10Andrew Bogott: wmcs: Add s8 to maintain-meta_p [puppet] - 10https://gerrit.wikimedia.org/r/402087 (https://phabricator.wikimedia.org/T184179) (owner: 10BryanDavis) [17:39:48] (03CR) 10Chad: [V: 032 C: 032] Add hooks plugin @ 2.13.9 [software/gerrit] - 10https://gerrit.wikimedia.org/r/401697 (https://phabricator.wikimedia.org/T183792) (owner: 10Chad) [17:40:25] !log demon@tin Started deploy [gerrit/gerrit@1e1a79d]: deploying hooks plugin [17:40:35] !log demon@tin Finished deploy [gerrit/gerrit@1e1a79d]: deploying hooks plugin (duration: 00m 10s) [17:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:10] paladox: ^^ [17:42:02] Oh, also I'm wondering if we can come up with /some/ sort of Jenkins job to validate us before merging? Like...I feel gross just +2 / +2 / Submit myself.... [17:42:04] PROBLEM - Check systemd state on pc2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:42:04] PROBLEM - Check whether ferm is active by checking the default input chain on pc2005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [17:42:36] !log upgrading HHVM on eqiad video scalers to 3.18.6 [17:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:34] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:52:13] RECOVERY - Check whether ferm is active by checking the default input chain on pc2005 is OK: OK ferm input default policy is set [17:53:04] RECOVERY - Check systemd state on pc2005 is OK: OK - running: The system is fully operational [17:55:06] (03PS3) 10Chad: Beta: Moving all docroots to standard-docroot [puppet] - 10https://gerrit.wikimedia.org/r/394203 [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180104T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:05:04] (03PS4) 10Chad: Beta: Moving all docroots to standard-docroot [puppet] - 10https://gerrit.wikimedia.org/r/394203 (https://phabricator.wikimedia.org/T126306) [18:05:06] (03PS1) 10Chad: Moving all docroots to standard-docroot [puppet] - 10https://gerrit.wikimedia.org/r/402090 (https://phabricator.wikimedia.org/T126306) [18:05:10] (03PS1) 10Chad: Drop unused docroots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402091 (https://phabricator.wikimedia.org/T126306) [18:20:22] 10Operations, 10ops-codfw, 10Cloud-VPS: Connect labtestvirt2003 eth1 and eth2 interface(s) to switch fabric - https://phabricator.wikimedia.org/T183167#3876000 (10RobH) synced up with @papaul via irc: labtestvirt2003:eth1:ge-1/0/12 labtestvirt2003:eth2:ge-1/0/14 [18:25:59] hey - I'm trying to connect to stat1006 but can't (I haven't before either) [18:26:12] I can connect to tin and bast1001 fine though [18:27:23] !log upgrade and restart labsdb1009 [18:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:19] haproxy will complain in a second while I reboot [18:31:45] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [18:34:36] edsanders: are you in the analytics-privatedata-users group? [18:35:02] * bd808 grumbles about that list not be in alphabetical order [18:35:38] edsanders: you aren't. So you need to be granted into that group [18:35:54] bd808: well that would explain it [18:35:56] (03PS1) 10Gehel: elasticsearch / prometheus: enable prometheus jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/402095 (https://phabricator.wikimedia.org/T181627) [18:36:03] how do I go about doing that? [18:36:28] * bd808 is looking for an example ticket [18:36:29] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch / prometheus: enable prometheus jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/402095 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [18:36:59] edsanders: make a ticket like T115548 [18:36:59] T115548: Requesting access to analytics-privatedata-users for Bryan Davis - https://phabricator.wikimedia.org/T115548 [18:37:24] and we should be back up [18:37:34] (03CR) 10Gehel: "I'm not entirely sure about the organization of the different files / classes. It is not entirely clear to me what should be in the elasti" [puppet] - 10https://gerrit.wikimedia.org/r/402095 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [18:37:45] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [18:39:02] (03PS2) 10Gehel: elasticsearch / prometheus: enable prometheus jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/402095 (https://phabricator.wikimedia.org/T181627) [18:40:03] thanks [18:43:34] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users for Ed Sanders - https://phabricator.wikimedia.org/T184206#3876092 (10Esanders) [18:46:58] (03CR) 10Gehel: "Puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/9578/" [puppet] - 10https://gerrit.wikimedia.org/r/402095 (https://phabricator.wikimedia.org/T181627) (owner: 10Gehel) [18:49:27] 10Operations, 10Cloud-VPS, 10cloud-services-team: 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#3876110 (10madhuvishy) [18:49:31] 10Operations, 10Cloud-VPS, 10cloud-services-team: templatetiger is using 827G of 8T available tools nfs storage - https://phabricator.wikimedia.org/T183954#3876108 (10madhuvishy) 05Open>03Resolved Thank you! [18:52:14] !log bsitzmann@tin Started deploy [mobileapps/deploy@8bcffa9]: Update mobileapps to a4ba9fd (T182330 T177430 T170690 T182652 T184198) [18:52:15] 10Operations, 10Cloud-VPS, 10cloud-services-team: wikidumpparse is using 1.2TB of 5T available NFS misc storage - https://phabricator.wikimedia.org/T183970#3876124 (10madhuvishy) @notconfusing @Dfko @Hargup Hello! Poke on this task again, could you please clean up the home folder soon, thank you. [18:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:30] T182652: Citations field becomes type with just a single value - https://phabricator.wikimedia.org/T182652 [18:52:30] T184198: French news has empty news story today - https://phabricator.wikimedia.org/T184198 [18:52:30] T182330: Media: handle galleries - https://phabricator.wikimedia.org/T182330 [18:52:30] T170690: Extract a References JSON API - https://phabricator.wikimedia.org/T170690 [18:52:31] T177430: Develop a Media JSON API - https://phabricator.wikimedia.org/T177430 [18:53:07] 10Operations, 10Cloud-VPS, 10cloud-services-team: tools.iabot is using 1.3T of 8T available tools nfs storage - https://phabricator.wikimedia.org/T183953#3876159 (10madhuvishy) @Cyberpower678 Any update on this? Thanks! [18:53:49] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3876160 (10Tgr) >>! In T180854#3874954, @Qgil wrote: > @Andrew @Austin @EBernhardson @Tgr @Samwilson @yuvipanda, as current admins of [[ htt... [18:58:14] !log bsitzmann@tin Finished deploy [mobileapps/deploy@8bcffa9]: Update mobileapps to a4ba9fd (T182330 T177430 T170690 T182652 T184198) (duration: 06m 01s) [18:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:28] T182652: Citations field becomes type with just a single value - https://phabricator.wikimedia.org/T182652 [18:58:34] T184198: French news has empty news story today - https://phabricator.wikimedia.org/T184198 [18:58:34] T182330: Media: handle galleries - https://phabricator.wikimedia.org/T182330 [18:58:34] T170690: Extract a References JSON API - https://phabricator.wikimedia.org/T170690 [18:58:34] T177430: Develop a Media JSON API - https://phabricator.wikimedia.org/T177430 [18:59:28] bd808: Do I need to do anything else? (https://phabricator.wikimedia.org/T184206) [19:00:05] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180104T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:02:26] edsanders: nope. I mean you could gently nudge some roots to review it and get the wait period started I guess [19:05:47] PROBLEM - Check size of conntrack table on mw1335 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [19:08:25] bd808: Is manager approval still required? [19:08:31] (I always thought that was a somewhat silly part of the process) [19:08:46] RoanKattouw: oh probably, and yes silly [19:09:05] although I guess expecting techops to know everyone and what they do is also silly [19:09:08] 10Operations, 10Cloud-Services, 10netops: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406#1903452 (10chasemp) > The 172.16.0.0/12 space (still RFC 1918) for private addresses, instead of 10/8. Right now our allocation for instances in the main deployment in... [19:10:01] That's true [19:12:56] RECOVERY - Check size of conntrack table on mw1335 is OK: OK: nf_conntrack is 74 % full [19:13:00] (03PS2) 10Gehel: elasticsearch: auto reload log4j2 configuration [puppet] - 10https://gerrit.wikimedia.org/r/388130 [19:13:11] I can an attest to rarely knowing who anyone is or what they do :) [19:13:52] (03CR) 10Gehel: [C: 032] elasticsearch: auto reload log4j2 configuration [puppet] - 10https://gerrit.wikimedia.org/r/388130 (owner: 10Gehel) [19:14:13] it would help to have an orgchart where you can check [19:14:56] Namely has an org chart, kind of [19:15:02] It does reliably know who someone's manager is [19:15:24] (03PS3) 10Alexandros Kosiaris: ircecho: Remove redundant thread [puppet] - 10https://gerrit.wikimedia.org/r/402081 [19:15:24] (Also the staff page can tell you which team someone's on, and what their title is) [19:15:26] (03PS1) 10Alexandros Kosiaris: ircecho: Force unbuffered stdin/stdout/stderr [puppet] - 10https://gerrit.wikimedia.org/r/402101 [19:15:28] ah yea, that's good [19:20:20] this doesn't eliminate a need for approval but it certainly makes it easier to find out who the manager is [19:21:18] there is a undocumented wish to build a self-service portal app that would make all of this easier. someday p.aravoid or I will find developer time to work on that :) [19:22:17] You know, we did that once [19:22:29] Erik Möller commissioned marktraceur to write one once [19:22:49] RIP [19:24:25] 10Operations, 10Data-Services, 10MediaWiki-Maintenance-scripts, 10Wikidata, 10Patch-For-Review: Missing references to s8 on maintenance and cloud scripts (and potentially others) - https://phabricator.wikimedia.org/T184179#3876321 (10bd808) I think my patches above take care of the Cloud Services/Data Se... [19:25:37] PROBLEM - HP RAID on db2054 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:4 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [19:25:39] ACKNOWLEDGEMENT - HP RAID on db2054 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:4 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T184210 [19:25:42] 10Operations, 10ops-codfw: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T184210#3876324 (10ops-monitoring-bot) [19:26:10] 10Operations, 10ops-codfw, 10DBA: db2054: Disk with predictive failure - https://phabricator.wikimedia.org/T183887#3876328 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [19:31:51] 10Operations, 10Ops-Access-Requests: Requesting access to Production SSH, statistics-privatedata-users, analytics-privatedata-users, perf-team for imarlier - https://phabricator.wikimedia.org/T184190#3876335 (10RobH) p:05Triage>03Normal [19:32:13] 10Operations, 10Ops-Access-Requests: Requesting access to Production SSH, statistics-privatedata-users, analytics-privatedata-users, perf-team for imarlier - https://phabricator.wikimedia.org/T184190#3875340 (10RobH) [19:34:18] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T184210#3876339 (10Peachey88) [19:37:25] no_justification :). [19:38:25] (03PS1) 10RobH: adding shell user imarlier [puppet] - 10https://gerrit.wikimedia.org/r/402102 (https://phabricator.wikimedia.org/T184190) [19:41:35] (03PS1) 10RobH: adding imarlier to groups [puppet] - 10https://gerrit.wikimedia.org/r/402103 [19:41:41] Hi ops-team - Little ping about me deploying analytics-refinery (analytics only stuff) [19:42:07] (03CR) 10jerkins-bot: [V: 04-1] adding imarlier to groups [puppet] - 10https://gerrit.wikimedia.org/r/402103 (owner: 10RobH) [19:42:17] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production SSH, statistics-privatedata-users, analytics-privatedata-users, perf-team for imarlier - https://phabricator.wikimedia.org/T184190#3876351 (10RobH) [19:42:27] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T184210#3876353 (10Marostegui) 05Open>03declined Thanks - this is because we replaced a disk which was on predicted failure: T183887 [19:43:29] (03PS2) 10RobH: adding imarlier to groups [puppet] - 10https://gerrit.wikimedia.org/r/402103 (https://phabricator.wikimedia.org/T184190) [19:43:35] ok the space AFTEr bug: is changed i could swear it required it NOT have a space not that long ago... [19:43:53] i have more failed commits in the past 6 months for commit messages...... [19:44:35] oh well. [19:46:21] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production SSH, statistics-privatedata-users, analytics-privatedata-users, perf-team for imarlier - https://phabricator.wikimedia.org/T184190#3875340 (10RobH) I've updated the task description with the checklist of required items.... [19:46:51] !log joal@tin Started deploy [analytics/refinery@a69a2cd]: Regular analytics deploy [19:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:58] (03PS1) 10Jgreen: A/PTR for civicrm-eqiad.wm.o at 208.80.152.232, remove deprecated frdev hostnames [dns] - 10https://gerrit.wikimedia.org/r/402105 [19:49:35] (03CR) 10Jgreen: [C: 032] A/PTR for civicrm-eqiad.wm.o at 208.80.152.232, remove deprecated frdev hostnames [dns] - 10https://gerrit.wikimedia.org/r/402105 (owner: 10Jgreen) [19:51:29] !log joal@tin Finished deploy [analytics/refinery@a69a2cd]: Regular analytics deploy (duration: 04m 38s) [19:51:32] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users for Ed Sanders - https://phabricator.wikimedia.org/T184206#3876366 (10RobH) [19:51:39] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production SSH, statistics-privatedata-users, analytics-privatedata-users, perf-team for imarlier - https://phabricator.wikimedia.org/T184190#3876367 (10Imarlier) Awesome -- thanks, Rob! [19:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:51] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users for Ed Sanders - https://phabricator.wikimedia.org/T184206#3876092 (10RobH) @esanders: I didn't see your signature on the L3 document. This is typically required for shell access. If your access precedes the document usage... [19:56:05] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users for Ed Sanders - https://phabricator.wikimedia.org/T184206#3876092 (10Ottomata) @Esanders, what data are you trying to access? `analytics-privatedata-users` does not get you access to stat1006 or the MySQL EventLogging datab... [20:00:04] no_justification: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180104T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:07] (03CR) 10Paladox: [C: 031] "Need to build the hooks plugin for 2.14." [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (https://phabricator.wikimedia.org/T156120) (owner: 10Chad) [20:03:52] !log preparing to deploy the train (filling in for no_justification) [20:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:23] RoanKattouw: care to take a look at https://phabricator.wikimedia.org/T184123 ? `git blame` says the change was your handiwork. [20:05:51] I'd attempt a patch but I'm actually not sure what 'text' is supposed to be [20:06:38] !log There are still open blockers for wmf.15 - see T180748 .. attempting to resolve them to unblock the train. [20:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:49] T180748: 1.31.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T180748 [20:11:20] 10Operations, 10Cloud-VPS, 10cloud-services-team: wikidumpparse is using 1.2TB of 5T available NFS misc storage - https://phabricator.wikimedia.org/T183970#3876440 (10Dfko) Hi, I am looking around for the offending files to delete them, but it has been a long while since I worked on any of this and I don't r... [20:19:43] 10Operations, 10Cloud-VPS, 10cloud-services-team: templatetiger is using 827G of 8T available tools nfs storage - https://phabricator.wikimedia.org/T183954#3876450 (10Peachey88) a:05madhuvishy>03Kolossos [20:20:04] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users for Ed Sanders - https://phabricator.wikimedia.org/T184206#3876092 (10Jdforrester-WMF) >>! In T184206#3876383, @Ottomata wrote: > @Esanders, what data are you trying to access? `analytics-privatedata-users` does not get you... [20:20:07] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users for Ed Sanders - https://phabricator.wikimedia.org/T184206#3876453 (10Esanders) >>! In T184206#3876366, @RobH wrote: > Please sign the L3, thanks! Done. [20:23:18] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users for Ed Sanders - https://phabricator.wikimedia.org/T184206#3876466 (10Ottomata) `researchers` and `analytics-privatedata-users` should be whatchu need. :) [20:25:13] twentyafterfour: Argh. It's supposed to be $sectionNAme [20:25:17] Fix coming [20:28:36] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/402109/1/includes/parser/Parser.php [20:37:13] (03CR) 10VolkerE: [C: 04-1] "Those are largely unoptimized. Please see https://www.mediawiki.org/wiki/Manual:Coding_conventions/SVG for in-depth optimization guideline" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401523 (https://phabricator.wikimedia.org/T178942) (owner: 10Urbanecm) [20:40:28] I never filed a task for that? cc twentyafterfour RoanKattouw? [20:40:39] I know I complained on IRC....like before Xmas [20:40:40] For what? [20:40:47] The undefined variable text thing? [20:42:13] Yeah [20:42:26] Not that I know of [20:45:11] Yeah I didn't file a task [20:45:15] I just bitched on IRC [20:45:19] (that usually works heh) [20:47:24] cherry-picking to wmf.15 [20:55:40] (03PS1) 10Rush: openstack: these servers should be an HA pair [puppet] - 10https://gerrit.wikimedia.org/r/402115 (https://phabricator.wikimedia.org/T167559) [20:56:06] (03CR) 10jerkins-bot: [V: 04-1] openstack: these servers should be an HA pair [puppet] - 10https://gerrit.wikimedia.org/r/402115 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [20:57:38] chasemp: odd error, eh [20:57:58] The hostname 'labtestneutron200[1-2].codfw.wmnet' contains illegal characters (only letters, digits, '_', '-', and '.' are allowed) but we have [] all the time [20:58:34] (03PS2) 10Rush: openstack: these servers should be an HA pair [puppet] - 10https://gerrit.wikimedia.org/r/402115 (https://phabricator.wikimedia.org/T167559) [20:58:46] mutante: yeah not sure but trying a patch now [20:58:51] oh, i see it [20:59:05] needs to start with node /^ [20:59:17] yeppers [21:00:54] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3876522 (10Qgil) Can you give me admin access, please? [21:00:57] (03CR) 10Rush: [C: 032] openstack: these servers should be an HA pair [puppet] - 10https://gerrit.wikimedia.org/r/402115 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [21:10:53] (03CR) 10Aaron Schulz: "Sorry, I merged shortly before a break and didn't revert." [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [21:14:27] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3876561 (10Tgr) Done. [21:21:34] (03PS18) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [21:24:14] (03PS1) 10BryanDavis: wmcs: maintain-meta_p missing python-requests [puppet] - 10https://gerrit.wikimedia.org/r/402117 [21:25:32] !log reboot multatuli for kernel update [21:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:10] (03PS1) 10Dzahn: httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 [21:32:57] (03CR) 10jerkins-bot: [V: 04-1] httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 (owner: 10Dzahn) [21:33:48] !log deploying patches to unblock the train [21:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:38] (03CR) 10BryanDavis: "Seems to work ok: https://puppet-compiler.wmflabs.org/compiler02/9579/labsdb1009.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/402117 (owner: 10BryanDavis) [21:35:03] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.15/includes/parser/Parser.php: Deploy 601cf9d183b0e5a97d264048efaab71a4a925500 (duration: 01m 03s) [21:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:55] (03CR) 10Madhuvishy: [C: 032] wmcs: maintain-meta_p missing python-requests [puppet] - 10https://gerrit.wikimedia.org/r/402117 (owner: 10BryanDavis) [21:37:47] !log uploaded linux-4.9.65-3+deb9u1~bpo8+2 for jessie-wikimedia to apt.wikimedia.org (provides KPTI backport) [21:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:13] (03PS1) 10BryanDavis: pcc: Python3 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/402119 [21:44:35] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3876587 (10herron) Just now I sent 3 testing messages with different subjects and 3 replies (one reply to each subject). The problem is happening for m... [21:46:49] (03PS1) 10Muehlenhoff: Bump meta package for new ABI in 4.9 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/402121 [21:49:50] (03CR) 10Muehlenhoff: [C: 032] Bump meta package for new ABI in 4.9 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/402121 (owner: 10Muehlenhoff) [21:53:20] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.15/extensions/TitleBlacklist/TitleBlacklistPreAuthenticationProvider.php: Deploy 332fab0d737b5a524abbed7264d64890dd3ce6dc to stop logspam and unblock the train (duration: 01m 02s) [21:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:21] (03PS1) 1020after4: all wikis to 1.31.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402123 [21:59:23] (03CR) 1020after4: [C: 032] all wikis to 1.31.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402123 (owner: 1020after4) [22:00:35] !log No blockers remain for T180748, proceeding to deploy wmf.15 to all wikis [22:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:45] T180748: 1.31.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T180748 [22:01:14] (03CR) 10Krinkle: [C: 031] mediawiki: Remove unused python-pygments package [puppet] - 10https://gerrit.wikimedia.org/r/400458 (https://phabricator.wikimedia.org/T182851) (owner: 10Legoktm) [22:02:07] (03Merged) 10jenkins-bot: all wikis to 1.31.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402123 (owner: 1020after4) [22:02:21] (03CR) 10jenkins-bot: all wikis to 1.31.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402123 (owner: 1020after4) [22:03:12] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: all wikis to 1.31.0-wmf.15 [22:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:35] (03PS19) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [22:04:40] (03PS2) 10Dzahn: httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 [22:05:41] RECOVERY - HP RAID on db2054 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [22:06:15] (03CR) 10jerkins-bot: [V: 04-1] httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 (owner: 10Dzahn) [22:09:47] !log uploaded linux-meta 1.16 for jessie-wikimedia to apt.wikimedia.org (which installs the new KPTI-enabled kernel with the new ABI) [22:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:32] (03PS3) 10Dzahn: httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 [22:25:46] (03CR) 10jerkins-bot: [V: 04-1] httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 (owner: 10Dzahn) [22:27:09] (03PS4) 10Dzahn: httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 [22:28:23] (03CR) 10jerkins-bot: [V: 04-1] httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 (owner: 10Dzahn) [22:29:03] how to remove the Ganglia line from https://wikitech.wikimedia.org/wiki/Template:Server without breaking the rest of the template ?:) [22:29:14] i did it wrong, mismatching brackets [22:29:44] it's a template using another template [22:30:19] mutante https://wikitech.wikimedia.org/w/index.php?title=Template%3AServer&type=revision&diff=1779540&oldid=1779538 [22:30:20] https://wikitech.wikimedia.org/w/index.php?title=Template:Ganglia&action=edit [22:30:59] paladox: :) thanks [22:31:03] your welcome :). [22:31:07] i didn't remove enough [22:31:16] heh [22:32:24] https://wikitech.wikimedia.org/w/index.php?title=Special%3AWhatLinksHere&target=Template%3AGanglia&namespace= uhmm [22:33:07] it still uses the template in another place [22:33:13] server template using ganglia template that is [22:33:32] but the link is just {{tl|Ganglia}}) [22:33:40] ;server_group: (optional) Name of organizational server group (not physical per se). Should match the "Source" group of the node in Ganglia (passed to {{tl|Ganglia}}) [22:33:40] ;server_nodename: (optional) "Node" hostname. Should match the "Node name" in Ganglia (passed to {{tl|Ganglia}}) [22:34:29] removes [22:34:42] https://wikitech.wikimedia.org/w/index.php?title=Special%3AWhatLinksHere&target=Template%3AGanglia&namespace= better [22:35:00] still lists Template::server itself :) [22:35:09] but no transclusion [22:36:19] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3876673 (10Trijnstel) >>! In T181906#3876587, @herron wrote: > Just now I sent 3 testing messages with different subjects and 3 replies (one reply to ea... [22:36:38] == See also == [22:36:39] * {{tl|Server}} [22:37:02] and NOW i can delete the template:) [22:37:22] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3876675 (10Natuur12) I received three emails from Herron, zero from Trijnstel. [22:38:34] (03PS5) 10Smalyshev: Add loading DCAT-AP data into dcatap namespace on WDQS [puppet] - 10https://gerrit.wikimedia.org/r/399954 (https://phabricator.wikimedia.org/T178978) [22:39:36] (03PS5) 10Dzahn: httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 [22:40:50] (03CR) 10jerkins-bot: [V: 04-1] httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 (owner: 10Dzahn) [22:41:28] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3876678 (10Paladox) I think upstream have started rolling out the security update. [22:42:49] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3875329 (10MoritzMuehlenhoff) @paladox: Most of WMCS runs trusty with either the 3.13 or 4.4 kernel and needs an update by Canonical (which is... [23:11:21] PROBLEM - HHVM rendering on mw2222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:11] RECOVERY - HHVM rendering on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 73405 bytes in 0.328 second response time [23:12:25] (03PS1) 10Dzahn: network::constants: add fake CACHE_MISC for labs [puppet] - 10https://gerrit.wikimedia.org/r/402136 [23:12:59] (03CR) 10jerkins-bot: [V: 04-1] network::constants: add fake CACHE_MISC for labs [puppet] - 10https://gerrit.wikimedia.org/r/402136 (owner: 10Dzahn) [23:15:13] (03CR) 10Thcipriani: [C: 031] "Seems to work as long as you don't want a service called ^A" [deployment-charts] - 10https://gerrit.wikimedia.org/r/399256 (owner: 10Dduvall) [23:17:25] (03PS1) 10BryanDavis: wmcs: Add database drop support to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/402137 (https://phabricator.wikimedia.org/T181925) [23:17:48] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Add database drop support to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/402137 (https://phabricator.wikimedia.org/T181925) (owner: 10BryanDavis) [23:19:31] (03PS2) 10BryanDavis: wmcs: Add database drop support to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/402137 (https://phabricator.wikimedia.org/T181925) [23:19:55] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Add database drop support to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/402137 (https://phabricator.wikimedia.org/T181925) (owner: 10BryanDavis) [23:21:34] (03PS6) 10Dzahn: httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 [23:23:06] (03CR) 10jerkins-bot: [V: 04-1] httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 (owner: 10Dzahn) [23:30:02] !log rebooted releases1001 and 2001 (new kernel) [23:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:16] (03PS3) 10BryanDavis: wmcs: Add database drop support to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/402137 (https://phabricator.wikimedia.org/T181925) [23:34:56] * bd808 wishes he could get the ops/puppet tests to execute locally [23:35:21] no puppet compiler? [23:35:27] dammit, I meant to not look at irc [23:36:15] apergos: no its something funky about bundler on my laptop I think. Weird and non-sensical ruby errors [23:36:22] pcc works for me [23:36:26] just not the linters [23:37:30] ah ha [23:38:28] and now I really am going to stop looking. good night folks [23:51:05] (03CR) 10Smalyshev: "I think this is now ready for merge" [puppet] - 10https://gerrit.wikimedia.org/r/399954 (https://phabricator.wikimedia.org/T178978) (owner: 10Smalyshev)