[00:08:11] <icinga-wm>	 PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: puppet fail
[00:34:42] <icinga-wm>	 RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[01:09:05] <icinga-wm>	 PROBLEM - puppet last run on mw2013 is CRITICAL: CRITICAL: puppet fail
[01:36:06] <icinga-wm>	 RECOVERY - puppet last run on mw2013 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[01:43:35] <icinga-wm>	 PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: puppet fail
[02:00:10] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate failed: git pull of core failed
[02:00:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:10:36] <icinga-wm>	 RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[03:35:46] <icinga-wm>	 PROBLEM - puppet last run on restbase-test2001 is CRITICAL: CRITICAL: puppet fail
[04:02:54] <icinga-wm>	 RECOVERY - puppet last run on restbase-test2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:55:24] <icinga-wm>	 RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures
[06:56:26] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[06:56:34] <icinga-wm>	 RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[06:56:45] <icinga-wm>	 RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[06:56:56] <icinga-wm>	 RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:14] <icinga-wm>	 RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:34] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[06:57:36] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[06:57:45] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:46] <icinga-wm>	 RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:05] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:25] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:00:45] <icinga-wm>	 PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: puppet fail
[07:18:34] <icinga-wm>	 PROBLEM - puppet last run on iodine is CRITICAL: CRITICAL: Puppet has 1 failures
[07:27:45] <icinga-wm>	 RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[07:37:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 405
[07:41:40] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Seconds_Behind_Master: 15
[07:42:34] <wikibugs>	 6operations, 5Patch-For-Review: Investigate redis connections errors on rdb100[13] - https://phabricator.wikimedia.org/T119739#1835717 (10Joe) 5Open>3Resolved
[07:43:45] <icinga-wm>	 RECOVERY - puppet last run on iodine is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[10:00:14] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 627
[10:05:14] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 1266042 Threads: 2 Questions: 21835105 Slow queries: 8461 Opens: 15016 Flush tables: 2 Open tables: 64 Queries per second avg: 17.246 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[10:59:35] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0]
[11:04:55] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [150.0]
[11:08:45] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0]
[11:13:25] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0]
[11:16:34] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0]
[11:16:44] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0]
[12:12:35] <icinga-wm>	 PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:38:05] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [150.0]
[12:38:44] <icinga-wm>	 PROBLEM - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [500.0]
[12:39:35] <icinga-wm>	 RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[12:39:45] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [150.0]
[12:51:24] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0]
[12:53:25] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0]
[12:54:04] <icinga-wm>	 RECOVERY - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:34:50] <wikibugs>	 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, 10Traffic: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1835921 (10zhuyifei1999) & https://commons.wikimedia.org/wiki/Commons:Village_pump#Problems_in_new_file_version
[14:11:55] <icinga-wm>	 PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: puppet fail
[14:38:56] <icinga-wm>	 RECOVERY - puppet last run on mw2190 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[16:02:30] <grrrit-wm>	 (03PS1) 10BBlack: Revert "purging: do not VCL-filter on domain regex" [puppet] - 10https://gerrit.wikimedia.org/r/255807 (https://phabricator.wikimedia.org/T119038) 
[16:02:32] <grrrit-wm>	 (03PS1) 10BBlack: Revert "upload purging: do not listen on text/mobile addr" [puppet] - 10https://gerrit.wikimedia.org/r/255808 (https://phabricator.wikimedia.org/T119038) 
[16:03:48] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] Revert "purging: do not VCL-filter on domain regex" [puppet] - 10https://gerrit.wikimedia.org/r/255807 (https://phabricator.wikimedia.org/T119038) (owner: 10BBlack)
[16:04:14] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] Revert "upload purging: do not listen on text/mobile addr" [puppet] - 10https://gerrit.wikimedia.org/r/255808 (https://phabricator.wikimedia.org/T119038) (owner: 10BBlack)
[16:09:24] <icinga-wm>	 PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: puppet fail
[16:13:06] <icinga-wm>	 RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:16:06] <wikibugs>	 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1836058 (10BBlack) Above I've done some partial reversion of the multiple commits involved in the multicast split work as an e...
[16:18:16] <icinga-wm>	 PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: puppet fail
[16:20:24] <icinga-wm>	 PROBLEM - HHVM rendering on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:20:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:21:44] <icinga-wm>	 PROBLEM - HHVM processes on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:21:45] <icinga-wm>	 PROBLEM - RAID on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:21:55] <icinga-wm>	 PROBLEM - puppet last run on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:21:55] <icinga-wm>	 PROBLEM - DPKG on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:22:05] <icinga-wm>	 PROBLEM - configured eth on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:22:06] <icinga-wm>	 RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[16:22:15] <icinga-wm>	 PROBLEM - nutcracker port on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:23:04] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:23:14] <icinga-wm>	 PROBLEM - SSH on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:23:25] <icinga-wm>	 PROBLEM - nutcracker process on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:23:25] <icinga-wm>	 PROBLEM - salt-minion processes on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:26:24] <icinga-wm>	 PROBLEM - dhclient process on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:26:35] <icinga-wm>	 PROBLEM - Disk space on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:37:35] <icinga-wm>	 RECOVERY - configured eth on mw1129 is OK: OK - interfaces up
[16:37:44] <icinga-wm>	 RECOVERY - nutcracker port on mw1129 is OK: TCP OK - 0.000 second response time on port 11212
[16:37:55] <icinga-wm>	 RECOVERY - dhclient process on mw1129 is OK: PROCS OK: 0 processes with command name dhclient
[16:38:05] <icinga-wm>	 RECOVERY - Disk space on mw1129 is OK: DISK OK
[16:38:25] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1129 is OK: OK: nf_conntrack is 0 % full
[16:38:34] <icinga-wm>	 RECOVERY - SSH on mw1129 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[16:38:46] <icinga-wm>	 RECOVERY - nutcracker process on mw1129 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[16:38:46] <icinga-wm>	 RECOVERY - salt-minion processes on mw1129 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[16:38:55] <icinga-wm>	 RECOVERY - HHVM processes on mw1129 is OK: PROCS OK: 6 processes with command name hhvm
[16:39:05] <icinga-wm>	 RECOVERY - RAID on mw1129 is OK: OK: no RAID installed
[16:39:15] <icinga-wm>	 RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 50 minutes ago with 0 failures
[16:39:16] <icinga-wm>	 RECOVERY - DPKG on mw1129 is OK: All packages OK
[16:39:35] <icinga-wm>	 RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 65926 bytes in 0.618 second response time
[16:40:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.074 second response time
[18:22:16] <grrrit-wm>	 (03PS1) 10Mdann52: Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) 
[18:22:37] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52)
[19:07:54] <grrrit-wm>	 (03PS2) 10Mdann52: Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) 
[19:08:12] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52)
[19:08:39] <grrrit-wm>	 (03CR) 10Mdann52: "Sometimes, I hate commas....." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52)
[19:12:32] <grrrit-wm>	 (03PS3) 10Mdann52: Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) 
[20:32:07] <grrrit-wm>	 (03CR) 10Reedy: "I think it can be swatted. A seperate deployment window wouldn't really help. I just think it needs some monitoring post deployment (over " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff)
[20:33:44] <grrrit-wm>	 (03PS5) 10Ori.livneh: Increase $wgCopyUploadTimeout to 90 seconds (from default 25) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff)
[20:33:50] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Increase $wgCopyUploadTimeout to 90 seconds (from default 25) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff)
[20:34:16] <grrrit-wm>	 (03Merged) 10jenkins-bot: Increase $wgCopyUploadTimeout to 90 seconds (from default 25) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff)
[20:35:00] <grrrit-wm>	 (03PS1) 10Reedy: Update Image Area to 100MP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255822 
[20:35:20] <logmsgbot>	 !log ori@tin Synchronized wmf-config/InitialiseSettings.php: Ie33ae3b6a: Increase $wgCopyUploadTimeout to 90 seconds (from default 25) (T118887) (duration: 00m 27s)
[20:35:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:35:45] <icinga-wm>	 PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures
[20:38:21] <grrrit-wm>	 (03CR) 10Reedy: "Various url-downloader timeouts are at https://github.com/wikimedia/operations-puppet/blob/812f280d16acfe3083259e8dfa7ce12ebf71da87/templa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff)
[20:41:22] <Reedy>	 gj ori :D
[20:45:28] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836241 (10Reedy)
[20:49:19] <logmsgbot>	 !log krenair@tin Synchronized php-1.27.0-wmf.7/cache/l10n: l10nupdate for 1.27.0-wmf.7 (duration: 07m 11s)
[20:49:19] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate failed: Failed to sync-dir 'php-1.27.0-wmf.7/cache/l10n'
[20:49:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:49:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:51:48] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836244 (10Reedy) ``` fatal: You don't exist. Go away! ```  That's apparently a ssh related error
[20:54:52] <Reedy>	 Krenair: Found it
[20:55:18] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836253 (10Reedy) ``` l10nupdate@tin:~$ git var GIT_AUTHOR_IDENT  *** Please tell me who you are.  Run    git config --global user.email "you@example.com"   git config --global user.n...
[20:55:52] <Reedy>	 I can fix it, but it's not puppetised
[20:55:57] <Reedy>	 And I don't know why it broke after the uid change
[20:57:06] <bawolff>	 thanks ori 
[21:02:14] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836257 (10Reedy) So this will fix it, but I don't know why it got broken. Was it by mutante changing the uid? How did it ever work before? Can we puppetise this somehow?  ``` l10nupd...
[21:02:35] <icinga-wm>	 RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:07:21] <Krenair>	 Reedy, it's not the only issue though
[21:07:33] <Reedy>	 No, but the other issues are scap related
[21:07:39] <Krenair>	 I'm not sure
[21:08:09] <Krenair>	 20:42:46 ['/srv/deployment/scap/scap/bin/sync-master', 'tin.eqiad.wmnet'] on mira.codfw.wmnet returned [70]: 20:42:46 Copying to mira.codfw.wmnet from tin.eqiad.wmnet
[21:08:09] <Krenair>	 20:42:46 Started rsync master
[21:08:09] <Krenair>	 sudo: a password is required
[21:08:09] <Krenair>	 20:42:46 Finished rsync master (duration: 00m 00s)
[21:08:09] <Krenair>	 20:42:46 Unhandled error:
[21:08:12] <Krenair>	 (stuff)
[21:08:19] <Krenair>	 20:42:46 sync-master failed: <CalledProcessError> Command '['sudo', '-n', '--', '/usr/local/bin/scap-master-sync', 'tin.eqiad.wmnet']' returned non-zero exit status 1
[21:08:32] <Krenair>	 "sudo: a password is required"?
[21:16:12] <Reedy>	 Doesn't the master co-sync sync as root to get all the permissions etc?
[21:16:31] <Reedy>	 I think you get similar if you run sync-file etc yourself... Or did at one point
[21:32:46] <Luke081515>	 something in gerrit: Maybe it could be useful to add User Mdann52 to whitelist, so that jenkins-bot will make all tests. The changes looked good, so I guess this is not a problem, isn't it? Here an example: https://gerrit.wikimedia.org/r/#/c/255810/
[21:34:53] <grrrit-wm>	 (03CR) 10Luke081515: [C: 04-1] "In my opinion we should ask, if they think, that sysop can remove these rights too. The point is: It is not useful, that they need to cont" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52)
[21:36:33] <Reedy>	 He's still relatively new
[21:36:43] <Reedy>	 Whitelisting gives quite a lot of potential
[21:39:19] <yuvipanda>	 Krenair: uh, I fixed grrrit-wm yesterda
[21:39:20] <yuvipanda>	 y
[21:39:24] <yuvipanda>	 was a kubernetes DNS issue
[21:43:33] <Luke081515>	 Reedy: hm, ok. I don't know the full potential, so...
[21:43:47] <Reedy>	 It was wider than it was
[21:43:54] <Reedy>	 But we narrowed the scope because there is the potential for abuse
[21:44:19] <grrrit-wm>	 (03PS4) 10Mdann52: Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) 
[21:48:22] <grrrit-wm>	 (03CR) 10Luke081515: [C: 04-1] "Sorry, if this was missunderstanding, but that was not a problem of your patch, this is a problem at the task at phabricator, this point i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52)
[21:49:15] <grrrit-wm>	 (03CR) 10Mdann52: "Personally, this seems to be common sense - in any case, this isn't going to be merged until January, so plenty of time to sort it out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52)
[22:21:35] <icinga-wm>	 PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures
[22:27:55] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836299 (10bd808) >>! In T119746#1836257, @Reedy wrote: > Can we puppetise this somehow?  We would just need to provision a ~l10nupdate/.gitconfig file I think. No idea why this is su...
[22:28:33] <Reedy>	 bd808: The only change I know that has happened was mutante|away changning the uid for l10nupdate
[22:28:50] <Reedy>	 But there wasn't a .gitconfig there before
[22:29:51] <bd808>	 *nod* I wou;dn't expect we suddenly have a new git version on tin or something
[22:30:26] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836300 (10Reedy) >>! In T119746#1836299, @bd808 wrote: >>>! In T119746#1836257, @Reedy wrote: >> Can we puppetise this somehow? >  > We would just need to provision a ~l10nupdate/.gi...
[22:32:05] * Reedy shrugs
[22:32:36] <yuvipanda>	 bd808: thanks for the replag tool!
[22:32:41] <bd808>	 it's the curse of scap. everytime we try to make it better we make it strangely unstable for a few weeks
[22:32:45] <bd808>	 yuvipanda: yw
[22:33:03] <Reedy>	 computers suck
[22:33:19] <yuvipanda>	 let's rewrite it in nodejs...
[22:33:39] * bd808 slaps yuvipanda with a large COBOL module
[22:33:40] <Reedy>	 Now yuvipanda, no need to be like that
[22:34:25] <yuvipanda>	 fine, fine :)
[22:34:28] <yuvipanda>	 is tin trusty yet?
[22:34:45] <Reedy>	 I don't think so
[22:34:57] <bd808>	 no, but I think we are pretty close to being able to switch to mira and reimage tin
[22:35:05] <Reedy>	 DISTRIB_DESCRIPTION="Ubuntu 12.04.5 LTS"
[22:36:55] <Krenair>	 couldn't we switch to mira tomorrow?
[22:37:33] <Reedy>	 Presumably we could, just we'd swap the tin/mira issues around
[22:38:00] <yuvipanda>	 is mira trusty?
[22:38:34] <bd808>	 well... I don't know about l10nupdate, trebuchet and we still don't have the uid for l10update pinned in puppet so that would have to be manually fixed on tin after reimaging
[22:38:58] <Reedy>	 yuvipanda: yus
[22:38:59] <bd808>	 I think j.oe was going to tackle trebuchet next
[22:39:29] <bd808>	 by introducing a service name for the deploy server and fixing all the git remotes to point at it
[22:39:55] <Krenair>	 bd808, l10nupdate is broken at the moment due to scap issues
[22:40:19] <Krenair>	 to enable it on mira you just change a hiera entry anyway
[22:40:25] <bd808>	 besides the git thing that Reedy was poking at?
[22:40:35] <Krenair>	 I have no idea what trebuchet is even used for other than scap itself
[22:40:43] <Reedy>	 yeah, the return value of sync-dir is used too sensatively
[22:41:02] <Krenair>	 anything of value?
[22:41:10] <bd808>	 lots of things (ieg, scholarships, parsoid, ocg, kibana, ...)
[22:41:29] <bd808>	 graphoid
[22:41:43] <bd808>	 pretty much everything except MW and restbase
[22:41:46] <Reedy>	 https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/files/l10nupdate-1#L108-L113
[22:41:54] <Reedy>	 if [[ $? -ne 0 ]]; then
[22:42:19] <bd808>	 so sync-dir is failing as l10nupdate?
[22:42:48] <Reedy>	 I think it's "failing"
[22:43:12] <Reedy>	 At a glance, it seems there's some error number returned because of the master co-sync
[22:43:23] <Reedy>	 But the sync-dir everywhere else is fine
[22:43:25] <Reedy>	 So it just doesn't continue
[22:43:40] <bd808>	 hmmm...
[22:43:43] <Reedy>	 we could just disable that error handling till the master cosync is fixed
[22:43:54] <Reedy>	 When I ran it manually, IIRC it did sync out everywhere
[22:44:07] <Reedy>	 It then ended
[22:44:13] <Reedy>	 Then run dsh -g mediawiki-installation -M -F 40 -- "sudo -u mwdeploy $SCAPDIR/scap-rebuild-cdbs" manually etc
[22:46:44] <icinga-wm>	 RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[22:48:13] <logmsgbot>	 !log bd808@tin Synchronized php-1.27.0-wmf.5/cache/l10n: bd808 testing l10nupdate sync-dir using stale branch (duration: 01m 29s)
[22:48:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:48:19] <bd808>	 ah ha
[22:48:38] <bd808>	 missing sudoer stuff -- 22:47:56 sync-master failed: <CalledProcessError> Command '['sudo', '-n', '--', '/usr/local/bin/scap-master-sync', 'tin.eqiad.wmnet']' returned non-zero exit status 1
[22:48:58] <bd808>	 we made that wikidev only
[22:49:09] <Reedy>	 i am disappoint
[22:49:15] <bd808>	 do we have an open bug for htis?
[22:49:23] <Reedy>	 Not specifically
[22:49:32] <bd808>	 now I am disapoint
[22:49:34] <Reedy>	 Just the "localisationupdate broken on wmf wikis" as above
[22:50:14] <Reedy>	 Cause, technically, it's still broken
[22:50:26] <Reedy>	 Easy enough fix then, just needs a root?
[22:51:03] <yuvipanda>	 anything I can do?
[22:51:16] <Reedy>	 Mebbe
[22:51:25] <Reedy>	 Dunno if bd808 is going to make a patch
[22:51:27] * Reedy has a look
[22:51:31] <yuvipanda>	 ok!
[22:51:43] <yuvipanda>	 I'll be around on and off for the next hour or something at least, so let me know :)
[22:51:49] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836325 (10bd808) ``` tin:/srv/mediawiki-staging  (git master $) l10nupdate$ /usr/local/bin/sudo-withagent l10nupdate /srv/deployment/scap/scap/bin/sync-dir --no-shared-authsock -D ss...
[22:51:51] * yuvipanda makes foooooood
[22:53:00] <Reedy>	 https://github.com/wikimedia/operations-puppet/blob/5d74cf04a44fc1a3e35f5380371116005a93f614/modules/scap/manifests/master.pp#L52
[22:53:11] <bd808>	 yeah
[22:53:13] <Reedy>	 bd808: ^^  I guess it's that line. Shall I make a patch?
[22:53:58] <bd808>	 sure. dig around and figure out if we can grant to 2 different users with no common groups
[22:54:17] <Reedy>	  user       => [ 'l10nupdate', 'mwdeploy', ],
[22:54:29] <bd808>	 or we need to finally do something to make l10nupdate not be special (not use its own ssh key)
[22:54:29] <Reedy>	 Will that not work?
[22:55:13] <bd808>	 dunno. I'll look at the puppet resource
[22:56:28] <bd808>	 I don't think that will work. See https://github.com/wikimedia/operations-puppet/blob/5d74cf04a44fc1a3e35f5380371116005a93f614/modules/sudo/templates/sudoers.erb
[22:57:00] <Reedy>	 So we duplicate the block/and or parameterise the variable?
[22:58:15] <bd808>	 the "right" way to do a sudoers rule to a list of arbitrary users is with a user_alias
[22:58:34] <bd808>	 you make an alias with the list of users and then grant the rule to that list
[22:59:01] <Reedy>	 hmm
[22:59:33] <bd808>	 which could be done in that erb template I suppose
[23:00:28] <bd808>	 l10nupdate is suce a special snowflake. its very annoying
[23:01:19] <Reedy>	 Mmmm
[23:01:59] <p858snake>	 bd808: bribe Reedy with enough stroopwaffles and he might rewrite it
[23:03:16] <bd808>	 the sudoers file would need "User_Alias scap_as_root = mwdeploy, l10nupdate" and the grant the rule to "scap_as_root" (or any other arbitrary name)
[23:03:44] <bd808>	 p858snake: if it was that easy I would have done it 2 years ago ;)
[23:04:14] <Krenair>	 why is l10nupdate separate like this anyway?
[23:04:15] <Reedy>	 <%- if @user_alias then -%> <%= @user_alias %><%- end -%>
[23:05:56] <Reedy>	 https://wikitech.wikimedia.org/wiki/LocalisationUpdate is out of date
[23:07:24] * Reedy deletes a block
[23:08:09] <bd808>	 Krenair: the user is separate for privilege separation (can't mess with MW code, only l10n files)
[23:08:28] <bd808>	 it also predates the shared ssh user for syncing
[23:08:44] <bd808>	 which is really maybe the aprt we should kill
[23:08:47] <bd808>	 *part
[23:09:09] <bd808>	 Reedy and I left a comment about that when we changed it to use sync-dir
[23:09:40] <bd808>	 it used to use dsh and duplicate a bunch of rsync logig
[23:10:10] * bd808 continues his tradition of shitty typing 
[23:10:32] <yuvipanda>	 like, javascript / php? :)
[23:10:51] * bd808 sends yuvipanda to reform school
[23:11:06] <bd808>	 this new snarkypanda, tsk
[23:12:17] <bd808>	 Reedy: I'm wondering if the better fix wouldn't be to allow l10nupdate to run sync-dir as mwdeploy
[23:12:30] <yuvipanda>	 I think I've been staring at perl6, go and rust too much...
[23:12:47] * bd808 tries to think of the bad things that l10nupdate could do with sync-dir
[23:12:56] * yuvipanda snobs harder
[23:13:22] * bd808 pats yuvipanda and tells him that his desire for strong typing will come and go as he grows up
[23:14:18] * yuvipanda agrees, thinks back to his Java (and C# and VB days)
[23:14:25] <_joe_>	 bd808: there is loose typing, and then there is "hi" == 0 and ===
[23:14:30] <_joe_>	 or well, javascript
[23:14:31] <bd808>	 when I came back to php+python from 7 years in java duck typing felt scary and wrong
[23:14:53] * yuvipanda is happy with python's typing
[23:14:57] <_joe_>	 I mean python has a decent type system, if you don't count utf-8 strings
[23:15:05] <_joe_>	 which they got wrong twice
[23:15:12] <bd808>	 everybody got utf-8 wrong
[23:15:19] <_joe_>	 ruby's typing is nice
[23:15:22] <yuvipanda>	 ok, we should let bd808 and Reedy fix things instead of sidetracking into language waaaaaaaaaaars
[23:15:28] <_joe_>	 ahahah
[23:15:29] <_joe_>	 yeah
[23:15:33] <_joe_>	 what's broken?
[23:15:41] <yuvipanda>	 the world!
[23:15:57] <_joe_>	 oh I thought something more nerdously interesting
[23:16:01] <bd808>	 the nightly l10nupdate sync is failing because of things we did to scap
[23:16:45] <bd808>	 l10nupdate runs sync-dir as itself rather than as the shared mwdeploy user. the l10nupdate user is missing rights for the master-master sync
[23:17:17] <yuvipanda>	 valhallasw`cloud: so PAWS now has resource limits - guaranteed 128M, upto 1G per user. also some CPU quotas
[23:17:19] <yuvipanda>	 now to test...
[23:17:20] <bd808>	 so we need to either grant l10nupdate those rights or undo the decision to keep it syncing as a distinct user
[23:18:42] <bd808>	 Reedy: do you remember if I was just being lazy in not making a new scap subcommand to do the l10n files sync and cdb rebuild?
[23:18:52] <_joe_>	 bd808: it's past midnight and I opened a new bottle of whisky tonight. I'm not inebriated, but too sleepy to help :/
[23:19:16] <bd808>	 _joe_: its the middle of a 4 day weekend for me. I'm not even here ;)
[23:19:24] <_joe_>	 ahah ok
[23:19:25] <Reedy>	 bd808: I think you were just busy, and I was improving the situation by minimal amounts of work
[23:20:10] <_joe_>	 bd808: what is the reason for that script to run as a separate user?
[23:20:10] <bd808>	 Reedy: sounds right. I think you got to all of this after I was supposed to be doing something else instead of scap
[23:21:33] <bd808>	 _joe_: originally privilege separation. It was doing automated updates of the MW hsots before we introduced the shared ssh user/key
[23:22:12] <bd808>	 it is still a good idea today that the nightly script can't mess with the php code that the wikis run
[23:23:20] <bd808>	 Reedy: if we made a new scap command for the sync + cdb build then we could grant l10nupdate the right to sudo that command as mwdeploy
[23:23:41] <Reedy>	 Mmm
[23:23:44] <bd808>	 that should mostly isolate the damage that l10nupdate could do
[23:23:47] <Reedy>	 There might be a task for that
[23:23:54] <bd808>	 and kill the key and the dsh usage
[23:25:55] <_joe_>	 bd808: I don't see why we can't run the sync as mwdeploy
[23:26:44] <Reedy>	 It should still be rewritten into scap though
[23:27:54] <bd808>	 _joe_: I think I agree. even if we let it sync anything from mediawiki-staging it can't do more than push code someone else has already staged there
[23:28:25] <bd808>	 but we should do it with a scap script that consolidates the two parts and lets us try and validate what is happening too
[23:28:45] <Reedy>	 I can't actually find a task for converting l10nupdate to scap
[23:28:48] <Reedy>	 I'm sure there was one
[23:29:05] <bd808>	 we probably closed it when sync-dir worked
[23:29:27] <Reedy>	 https://phabricator.wikimedia.org/T72443
[23:29:50] <bd808>	 *nod*
[23:29:50] <Reedy>	 Mmm
[23:29:56] <Reedy>	 Just open a new one do you think?
[23:30:17] <bd808>	 or jsut use T119746
[23:33:25] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836350 (10bd808) >>! In T119746#1836325, @bd808 wrote: > The curse of l10nupdate strikes again. When we granted the sudoer rights to run the master-master sync as root we only grante...