[00:08:11] PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: puppet fail [00:34:42] RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [01:09:05] PROBLEM - puppet last run on mw2013 is CRITICAL: CRITICAL: puppet fail [01:36:06] RECOVERY - puppet last run on mw2013 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [01:43:35] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: puppet fail [02:00:10] !log l10nupdate@tin LocalisationUpdate failed: git pull of core failed [02:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:10:36] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [03:35:46] PROBLEM - puppet last run on restbase-test2001 is CRITICAL: CRITICAL: puppet fail [04:02:54] RECOVERY - puppet last run on restbase-test2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:55:24] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:56:26] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:56:34] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:45] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:56] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:14] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:34] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:57:36] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:57:45] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:05] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:25] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:45] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: puppet fail [07:18:34] PROBLEM - puppet last run on iodine is CRITICAL: CRITICAL: Puppet has 1 failures [07:27:45] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:37:50] PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 405 [07:41:40] RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Seconds_Behind_Master: 15 [07:42:34] 6operations, 5Patch-For-Review: Investigate redis connections errors on rdb100[13] - https://phabricator.wikimedia.org/T119739#1835717 (10Joe) 5Open>3Resolved [07:43:45] RECOVERY - puppet last run on iodine is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [10:00:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 627 [10:05:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 1266042 Threads: 2 Questions: 21835105 Slow queries: 8461 Opens: 15016 Flush tables: 2 Open tables: 64 Queries per second avg: 17.246 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:59:35] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [11:04:55] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [150.0] [11:08:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [11:13:25] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [11:16:34] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [11:16:44] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [12:12:35] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Puppet has 1 failures [12:38:05] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [150.0] [12:38:44] PROBLEM - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [500.0] [12:39:35] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:39:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [150.0] [12:51:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [12:53:25] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [12:54:04] RECOVERY - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:34:50] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, 10Traffic: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1835921 (10zhuyifei1999) & https://commons.wikimedia.org/wiki/Commons:Village_pump#Problems_in_new_file_version [14:11:55] PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: puppet fail [14:38:56] RECOVERY - puppet last run on mw2190 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:02:30] (03PS1) 10BBlack: Revert "purging: do not VCL-filter on domain regex" [puppet] - 10https://gerrit.wikimedia.org/r/255807 (https://phabricator.wikimedia.org/T119038) [16:02:32] (03PS1) 10BBlack: Revert "upload purging: do not listen on text/mobile addr" [puppet] - 10https://gerrit.wikimedia.org/r/255808 (https://phabricator.wikimedia.org/T119038) [16:03:48] (03CR) 10BBlack: [C: 032] Revert "purging: do not VCL-filter on domain regex" [puppet] - 10https://gerrit.wikimedia.org/r/255807 (https://phabricator.wikimedia.org/T119038) (owner: 10BBlack) [16:04:14] (03CR) 10BBlack: [C: 032] Revert "upload purging: do not listen on text/mobile addr" [puppet] - 10https://gerrit.wikimedia.org/r/255808 (https://phabricator.wikimedia.org/T119038) (owner: 10BBlack) [16:09:24] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: puppet fail [16:13:06] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:16:06] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1836058 (10BBlack) Above I've done some partial reversion of the multiple commits involved in the multicast split work as an e... [16:18:16] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: puppet fail [16:20:24] PROBLEM - HHVM rendering on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:20:54] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:44] PROBLEM - HHVM processes on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:21:45] PROBLEM - RAID on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:21:55] PROBLEM - puppet last run on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:21:55] PROBLEM - DPKG on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:22:05] PROBLEM - configured eth on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:22:06] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:22:15] PROBLEM - nutcracker port on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:23:04] PROBLEM - Check size of conntrack table on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:23:14] PROBLEM - SSH on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:25] PROBLEM - nutcracker process on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:23:25] PROBLEM - salt-minion processes on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:24] PROBLEM - dhclient process on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:35] PROBLEM - Disk space on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:37:35] RECOVERY - configured eth on mw1129 is OK: OK - interfaces up [16:37:44] RECOVERY - nutcracker port on mw1129 is OK: TCP OK - 0.000 second response time on port 11212 [16:37:55] RECOVERY - dhclient process on mw1129 is OK: PROCS OK: 0 processes with command name dhclient [16:38:05] RECOVERY - Disk space on mw1129 is OK: DISK OK [16:38:25] RECOVERY - Check size of conntrack table on mw1129 is OK: OK: nf_conntrack is 0 % full [16:38:34] RECOVERY - SSH on mw1129 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:38:46] RECOVERY - nutcracker process on mw1129 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:38:46] RECOVERY - salt-minion processes on mw1129 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:38:55] RECOVERY - HHVM processes on mw1129 is OK: PROCS OK: 6 processes with command name hhvm [16:39:05] RECOVERY - RAID on mw1129 is OK: OK: no RAID installed [16:39:15] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 50 minutes ago with 0 failures [16:39:16] RECOVERY - DPKG on mw1129 is OK: All packages OK [16:39:35] RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 65926 bytes in 0.618 second response time [16:40:04] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.074 second response time [18:22:16] (03PS1) 10Mdann52: Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) [18:22:37] (03CR) 10jenkins-bot: [V: 04-1] Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52) [19:07:54] (03PS2) 10Mdann52: Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) [19:08:12] (03CR) 10jenkins-bot: [V: 04-1] Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52) [19:08:39] (03CR) 10Mdann52: "Sometimes, I hate commas....." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52) [19:12:32] (03PS3) 10Mdann52: Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) [20:32:07] (03CR) 10Reedy: "I think it can be swatted. A seperate deployment window wouldn't really help. I just think it needs some monitoring post deployment (over " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff) [20:33:44] (03PS5) 10Ori.livneh: Increase $wgCopyUploadTimeout to 90 seconds (from default 25) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff) [20:33:50] (03CR) 10Ori.livneh: [C: 032] Increase $wgCopyUploadTimeout to 90 seconds (from default 25) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff) [20:34:16] (03Merged) 10jenkins-bot: Increase $wgCopyUploadTimeout to 90 seconds (from default 25) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff) [20:35:00] (03PS1) 10Reedy: Update Image Area to 100MP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255822 [20:35:20] !log ori@tin Synchronized wmf-config/InitialiseSettings.php: Ie33ae3b6a: Increase $wgCopyUploadTimeout to 90 seconds (from default 25) (T118887) (duration: 00m 27s) [20:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:35:45] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:21] (03CR) 10Reedy: "Various url-downloader timeouts are at https://github.com/wikimedia/operations-puppet/blob/812f280d16acfe3083259e8dfa7ce12ebf71da87/templa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff) [20:41:22] gj ori :D [20:45:28] 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836241 (10Reedy) [20:49:19] !log krenair@tin Synchronized php-1.27.0-wmf.7/cache/l10n: l10nupdate for 1.27.0-wmf.7 (duration: 07m 11s) [20:49:19] !log l10nupdate@tin LocalisationUpdate failed: Failed to sync-dir 'php-1.27.0-wmf.7/cache/l10n' [20:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:51:48] 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836244 (10Reedy) ``` fatal: You don't exist. Go away! ``` That's apparently a ssh related error [20:54:52] Krenair: Found it [20:55:18] 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836253 (10Reedy) ``` l10nupdate@tin:~$ git var GIT_AUTHOR_IDENT *** Please tell me who you are. Run git config --global user.email "you@example.com" git config --global user.n... [20:55:52] I can fix it, but it's not puppetised [20:55:57] And I don't know why it broke after the uid change [20:57:06] thanks ori [21:02:14] 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836257 (10Reedy) So this will fix it, but I don't know why it got broken. Was it by mutante changing the uid? How did it ever work before? Can we puppetise this somehow? ``` l10nupd... [21:02:35] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:07:21] Reedy, it's not the only issue though [21:07:33] No, but the other issues are scap related [21:07:39] I'm not sure [21:08:09] 20:42:46 ['/srv/deployment/scap/scap/bin/sync-master', 'tin.eqiad.wmnet'] on mira.codfw.wmnet returned [70]: 20:42:46 Copying to mira.codfw.wmnet from tin.eqiad.wmnet [21:08:09] 20:42:46 Started rsync master [21:08:09] sudo: a password is required [21:08:09] 20:42:46 Finished rsync master (duration: 00m 00s) [21:08:09] 20:42:46 Unhandled error: [21:08:12] (stuff) [21:08:19] 20:42:46 sync-master failed: Command '['sudo', '-n', '--', '/usr/local/bin/scap-master-sync', 'tin.eqiad.wmnet']' returned non-zero exit status 1 [21:08:32] "sudo: a password is required"? [21:16:12] Doesn't the master co-sync sync as root to get all the permissions etc? [21:16:31] I think you get similar if you run sync-file etc yourself... Or did at one point [21:32:46] something in gerrit: Maybe it could be useful to add User Mdann52 to whitelist, so that jenkins-bot will make all tests. The changes looked good, so I guess this is not a problem, isn't it? Here an example: https://gerrit.wikimedia.org/r/#/c/255810/ [21:34:53] (03CR) 10Luke081515: [C: 04-1] "In my opinion we should ask, if they think, that sysop can remove these rights too. The point is: It is not useful, that they need to cont" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52) [21:36:33] He's still relatively new [21:36:43] Whitelisting gives quite a lot of potential [21:39:19] Krenair: uh, I fixed grrrit-wm yesterda [21:39:20] y [21:39:24] was a kubernetes DNS issue [21:43:33] Reedy: hm, ok. I don't know the full potential, so... [21:43:47] It was wider than it was [21:43:54] But we narrowed the scope because there is the potential for abuse [21:44:19] (03PS4) 10Mdann52: Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) [21:48:22] (03CR) 10Luke081515: [C: 04-1] "Sorry, if this was missunderstanding, but that was not a problem of your patch, this is a problem at the task at phabricator, this point i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52) [21:49:15] (03CR) 10Mdann52: "Personally, this seems to be common sense - in any case, this isn't going to be merged until January, so plenty of time to sort it out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52) [22:21:35] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [22:27:55] 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836299 (10bd808) >>! In T119746#1836257, @Reedy wrote: > Can we puppetise this somehow? We would just need to provision a ~l10nupdate/.gitconfig file I think. No idea why this is su... [22:28:33] bd808: The only change I know that has happened was mutante|away changning the uid for l10nupdate [22:28:50] But there wasn't a .gitconfig there before [22:29:51] *nod* I wou;dn't expect we suddenly have a new git version on tin or something [22:30:26] 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836300 (10Reedy) >>! In T119746#1836299, @bd808 wrote: >>>! In T119746#1836257, @Reedy wrote: >> Can we puppetise this somehow? > > We would just need to provision a ~l10nupdate/.gi... [22:32:05] * Reedy shrugs [22:32:36] bd808: thanks for the replag tool! [22:32:41] it's the curse of scap. everytime we try to make it better we make it strangely unstable for a few weeks [22:32:45] yuvipanda: yw [22:33:03] computers suck [22:33:19] let's rewrite it in nodejs... [22:33:39] * bd808 slaps yuvipanda with a large COBOL module [22:33:40] Now yuvipanda, no need to be like that [22:34:25] fine, fine :) [22:34:28] is tin trusty yet? [22:34:45] I don't think so [22:34:57] no, but I think we are pretty close to being able to switch to mira and reimage tin [22:35:05] DISTRIB_DESCRIPTION="Ubuntu 12.04.5 LTS" [22:36:55] couldn't we switch to mira tomorrow? [22:37:33] Presumably we could, just we'd swap the tin/mira issues around [22:38:00] is mira trusty? [22:38:34] well... I don't know about l10nupdate, trebuchet and we still don't have the uid for l10update pinned in puppet so that would have to be manually fixed on tin after reimaging [22:38:58] yuvipanda: yus [22:38:59] I think j.oe was going to tackle trebuchet next [22:39:29] by introducing a service name for the deploy server and fixing all the git remotes to point at it [22:39:55] bd808, l10nupdate is broken at the moment due to scap issues [22:40:19] to enable it on mira you just change a hiera entry anyway [22:40:25] besides the git thing that Reedy was poking at? [22:40:35] I have no idea what trebuchet is even used for other than scap itself [22:40:43] yeah, the return value of sync-dir is used too sensatively [22:41:02] anything of value? [22:41:10] lots of things (ieg, scholarships, parsoid, ocg, kibana, ...) [22:41:29] graphoid [22:41:43] pretty much everything except MW and restbase [22:41:46] https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/files/l10nupdate-1#L108-L113 [22:41:54] if [[ $? -ne 0 ]]; then [22:42:19] so sync-dir is failing as l10nupdate? [22:42:48] I think it's "failing" [22:43:12] At a glance, it seems there's some error number returned because of the master co-sync [22:43:23] But the sync-dir everywhere else is fine [22:43:25] So it just doesn't continue [22:43:40] hmmm... [22:43:43] we could just disable that error handling till the master cosync is fixed [22:43:54] When I ran it manually, IIRC it did sync out everywhere [22:44:07] It then ended [22:44:13] Then run dsh -g mediawiki-installation -M -F 40 -- "sudo -u mwdeploy $SCAPDIR/scap-rebuild-cdbs" manually etc [22:46:44] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [22:48:13] !log bd808@tin Synchronized php-1.27.0-wmf.5/cache/l10n: bd808 testing l10nupdate sync-dir using stale branch (duration: 01m 29s) [22:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:48:19] ah ha [22:48:38] missing sudoer stuff -- 22:47:56 sync-master failed: Command '['sudo', '-n', '--', '/usr/local/bin/scap-master-sync', 'tin.eqiad.wmnet']' returned non-zero exit status 1 [22:48:58] we made that wikidev only [22:49:09] i am disappoint [22:49:15] do we have an open bug for htis? [22:49:23] Not specifically [22:49:32] now I am disapoint [22:49:34] Just the "localisationupdate broken on wmf wikis" as above [22:50:14] Cause, technically, it's still broken [22:50:26] Easy enough fix then, just needs a root? [22:51:03] anything I can do? [22:51:16] Mebbe [22:51:25] Dunno if bd808 is going to make a patch [22:51:27] * Reedy has a look [22:51:31] ok! [22:51:43] I'll be around on and off for the next hour or something at least, so let me know :) [22:51:49] 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836325 (10bd808) ``` tin:/srv/mediawiki-staging (git master $) l10nupdate$ /usr/local/bin/sudo-withagent l10nupdate /srv/deployment/scap/scap/bin/sync-dir --no-shared-authsock -D ss... [22:51:51] * yuvipanda makes foooooood [22:53:00] https://github.com/wikimedia/operations-puppet/blob/5d74cf04a44fc1a3e35f5380371116005a93f614/modules/scap/manifests/master.pp#L52 [22:53:11] yeah [22:53:13] bd808: ^^ I guess it's that line. Shall I make a patch? [22:53:58] sure. dig around and figure out if we can grant to 2 different users with no common groups [22:54:17] user => [ 'l10nupdate', 'mwdeploy', ], [22:54:29] or we need to finally do something to make l10nupdate not be special (not use its own ssh key) [22:54:29] Will that not work? [22:55:13] dunno. I'll look at the puppet resource [22:56:28] I don't think that will work. See https://github.com/wikimedia/operations-puppet/blob/5d74cf04a44fc1a3e35f5380371116005a93f614/modules/sudo/templates/sudoers.erb [22:57:00] So we duplicate the block/and or parameterise the variable? [22:58:15] the "right" way to do a sudoers rule to a list of arbitrary users is with a user_alias [22:58:34] you make an alias with the list of users and then grant the rule to that list [22:59:01] hmm [22:59:33] which could be done in that erb template I suppose [23:00:28] l10nupdate is suce a special snowflake. its very annoying [23:01:19] Mmmm [23:01:59] bd808: bribe Reedy with enough stroopwaffles and he might rewrite it [23:03:16] the sudoers file would need "User_Alias scap_as_root = mwdeploy, l10nupdate" and the grant the rule to "scap_as_root" (or any other arbitrary name) [23:03:44] p858snake: if it was that easy I would have done it 2 years ago ;) [23:04:14] why is l10nupdate separate like this anyway? [23:04:15] <%- if @user_alias then -%> <%= @user_alias %><%- end -%> [23:05:56] https://wikitech.wikimedia.org/wiki/LocalisationUpdate is out of date [23:07:24] * Reedy deletes a block [23:08:09] Krenair: the user is separate for privilege separation (can't mess with MW code, only l10n files) [23:08:28] it also predates the shared ssh user for syncing [23:08:44] which is really maybe the aprt we should kill [23:08:47] *part [23:09:09] Reedy and I left a comment about that when we changed it to use sync-dir [23:09:40] it used to use dsh and duplicate a bunch of rsync logig [23:10:10] * bd808 continues his tradition of shitty typing [23:10:32] like, javascript / php? :) [23:10:51] * bd808 sends yuvipanda to reform school [23:11:06] this new snarkypanda, tsk [23:12:17] Reedy: I'm wondering if the better fix wouldn't be to allow l10nupdate to run sync-dir as mwdeploy [23:12:30] I think I've been staring at perl6, go and rust too much... [23:12:47] * bd808 tries to think of the bad things that l10nupdate could do with sync-dir [23:12:56] * yuvipanda snobs harder [23:13:22] * bd808 pats yuvipanda and tells him that his desire for strong typing will come and go as he grows up [23:14:18] * yuvipanda agrees, thinks back to his Java (and C# and VB days) [23:14:25] <_joe_> bd808: there is loose typing, and then there is "hi" == 0 and === [23:14:30] <_joe_> or well, javascript [23:14:31] when I came back to php+python from 7 years in java duck typing felt scary and wrong [23:14:53] * yuvipanda is happy with python's typing [23:14:57] <_joe_> I mean python has a decent type system, if you don't count utf-8 strings [23:15:05] <_joe_> which they got wrong twice [23:15:12] everybody got utf-8 wrong [23:15:19] <_joe_> ruby's typing is nice [23:15:22] ok, we should let bd808 and Reedy fix things instead of sidetracking into language waaaaaaaaaaars [23:15:28] <_joe_> ahahah [23:15:29] <_joe_> yeah [23:15:33] <_joe_> what's broken? [23:15:41] the world! [23:15:57] <_joe_> oh I thought something more nerdously interesting [23:16:01] the nightly l10nupdate sync is failing because of things we did to scap [23:16:45] l10nupdate runs sync-dir as itself rather than as the shared mwdeploy user. the l10nupdate user is missing rights for the master-master sync [23:17:17] valhallasw`cloud: so PAWS now has resource limits - guaranteed 128M, upto 1G per user. also some CPU quotas [23:17:19] now to test... [23:17:20] so we need to either grant l10nupdate those rights or undo the decision to keep it syncing as a distinct user [23:18:42] Reedy: do you remember if I was just being lazy in not making a new scap subcommand to do the l10n files sync and cdb rebuild? [23:18:52] <_joe_> bd808: it's past midnight and I opened a new bottle of whisky tonight. I'm not inebriated, but too sleepy to help :/ [23:19:16] _joe_: its the middle of a 4 day weekend for me. I'm not even here ;) [23:19:24] <_joe_> ahah ok [23:19:25] bd808: I think you were just busy, and I was improving the situation by minimal amounts of work [23:20:10] <_joe_> bd808: what is the reason for that script to run as a separate user? [23:20:10] Reedy: sounds right. I think you got to all of this after I was supposed to be doing something else instead of scap [23:21:33] _joe_: originally privilege separation. It was doing automated updates of the MW hsots before we introduced the shared ssh user/key [23:22:12] it is still a good idea today that the nightly script can't mess with the php code that the wikis run [23:23:20] Reedy: if we made a new scap command for the sync + cdb build then we could grant l10nupdate the right to sudo that command as mwdeploy [23:23:41] Mmm [23:23:44] that should mostly isolate the damage that l10nupdate could do [23:23:47] There might be a task for that [23:23:54] and kill the key and the dsh usage [23:25:55] <_joe_> bd808: I don't see why we can't run the sync as mwdeploy [23:26:44] It should still be rewritten into scap though [23:27:54] _joe_: I think I agree. even if we let it sync anything from mediawiki-staging it can't do more than push code someone else has already staged there [23:28:25] but we should do it with a scap script that consolidates the two parts and lets us try and validate what is happening too [23:28:45] I can't actually find a task for converting l10nupdate to scap [23:28:48] I'm sure there was one [23:29:05] we probably closed it when sync-dir worked [23:29:27] https://phabricator.wikimedia.org/T72443 [23:29:50] *nod* [23:29:50] Mmm [23:29:56] Just open a new one do you think? [23:30:17] or jsut use T119746 [23:33:25] 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1836350 (10bd808) >>! In T119746#1836325, @bd808 wrote: > The curse of l10nupdate strikes again. When we granted the sudoer rights to run the master-master sync as root we only grante...