[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150723T0000). [00:00:56] (03PS1) 10Alex Monk: Fix Ia02c239a: Use NS IDs instead of not-necessarily-defined constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226451 [00:01:14] (03CR) 10Alex Monk: [C: 032] Fix Ia02c239a: Use NS IDs instead of not-necessarily-defined constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226451 (owner: 10Alex Monk) [00:01:22] (03Merged) 10jenkins-bot: Fix Ia02c239a: Use NS IDs instead of not-necessarily-defined constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226451 (owner: 10Alex Monk) [00:02:22] !log krenair Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/225541/ (duration: 00m 12s) [00:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:02:47] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/225541/ (duration: 00m 13s) [00:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:04:06] twentyafterfour, are you about to take down phab? [00:05:47] woah wtf is this [00:07:04] 2015-07-23 00:06:38 mw1153 commonswiki exception ERROR: [d68c4bb1] /w/thumb_handler.php/2/2e/Mirage_III_A_01_Mus%0Aee_du_Bourget_P1020118.JPG/424px-%0AMirage_III_A_01_Musee_du_Bourget_P1020118.JPG MWException from line 171 of /srv/mediawiki/php-1.26wmf15/includes/objectcache/ObjectCache.php: CACHE_ACCEL requested but no suitable object cache is present. You may want to install APC. {"exception":"[object] (MWException(code: 0): CACHE_ACCEL [00:07:04] requested but no suitable object cache is present. You may want to install APC. at /srv/mediawiki/php-1.26wmf15/includes/objectcache/ObjectCache.php:171)"} [00:07:45] ori, gilles ^ [00:09:35] mw1153 is an hhvm image scaler [00:10:11] 6operations, 10Wikimedia-Site-requests, 5Patch-For-Review: Extension RSS fails to connect to feeds - https://phabricator.wikimedia.org/T90513#1473635 (10Krenair) 5Open>3Resolved a:3Krenair Should be better now. Please open separate tasks for specific URLs which fail due to redirects. [00:12:14] bd808, is that one of the newly reimaged ones? [00:12:26] logging in gave me a scary host key change warning [00:12:27] uptime says 14 days [00:12:37] that'd probably be why [00:12:47] PROBLEM - puppet last run on cp3041 is CRITICAL Puppet has 1 failures [00:13:35] bd808, do those imagescalers normally serve normal user traffic? [00:14:09] I think that's the exception I generated by browsing to the url provided in a previous one.... [00:14:54] image scalers would get thumb requests normally, yes [00:15:18] if there wasn't a varnish/swift hit for the image [00:15:26] okay [00:15:44] I'm going to dump this in the task about converting the eqiad imagescalers [00:15:54] apc should be built into hhvm [00:16:01] which is the weird part there [00:16:54] thumbnail works fine now: https://upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Mirage_III_A_01_Musee_du_Bourget_P1020118.JPG/424px-Mirage_III_A_01_Musee_du_Bourget_P1020118.JPG [00:17:40] do file a bug or add that to the imagescaler hhvm bug [00:19:17] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 5 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1473663 (10Krenair) ```2015-07-23 00:06:38 mw1153 commonswiki exception ERROR: [d68c4bb1] /w/thumb_handler.php/2/2e/Mirage_III_A_01_Mus%0Aee_du_Bou... [00:19:18] https://phabricator.wikimedia.org/T84842#1473663 [00:20:31] Jul 23 00:20:22 mw1173: message repeated 18795 times: [ #012Notice: Undefined variable: wmgVisualEditorAutoAccountEnable in /srv/mediawiki/wmf-config/CommonSettings.php on line 2037] [00:20:34] This one won't go away [00:20:37] James_F, ^ [00:20:41] Argh. [00:21:08] 6operations, 6Multimedia, 6Performance-Team, 10Wikimedia-Site-requests, and 2 others: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1473675 (10Gilles) 5Open>3Resolved I think that the main request has been addressed. Cleaning up the existing... [00:21:11] But it makes no sense. [00:21:16] This was in InitialiseSettings [00:21:44] Krenair: Did you sync InitSettings? [00:21:48] Krenair: Or just Common? [00:21:57] (Stupid question, but…) [00:22:07] it has a '$' in InitialiseSettings.php [00:22:15] oh. [00:22:17] '$wmgVisualEditorAutoAccountEnable' => array() [00:22:25] sneaky [00:22:27] hah. okay, no wonder that makes it complain [00:22:35] Ha. [00:22:39] I'm an idiot. [00:22:43] Sorry, Krenair. [00:23:12] I should have caught that :P [00:23:16] !log krenair Synchronized wmf-config/InitialiseSettings.php: fix extra dollar mark in https://gerrit.wikimedia.org/r/#/c/226336/1/wmf-config/InitialiseSettings.php (duration: 00m 12s) [00:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:23:54] looks good now [00:23:59] We should really have some unit tests for stuff like that in wmf-config [00:24:11] * James_F nods. [00:24:14] I would be willing to bet money this has happened before [00:24:24] Probably by my hand, no less. :-( [00:24:37] or ... we should let legoktm write a new config system for us ;) [00:25:07] * James_F grins. [00:25:09] At least I bother to open 3 separate sessions so I can monitor these logs :) [00:26:53] logs coming in now are all for old occurrences [00:27:38] amazing how a simple extra character in one of these files can lead to hundreds of thousands/millions of log entries being generated [00:28:14] oh, apart from this one server: [00:28:15] Jul 23 00:24:57 mw1145: message repeated 2079 times: [ #012Notice: Undefined variable: wmgVisualEditorAutoAccountEnable in /srv/mediawiki/wmf-config/CommonSettings.php on line 2037] [00:28:16] Jul 23 00:24:57 mw1145: #012Notice: Undefined variable: wmgVisualEditorAutoAccountEnable in /srv/mediawiki/wmf-config/CommonSettings.php on line 2037 [00:28:32] which apparently didn't get the memo. [00:29:20] although the code there looks updated [00:29:38] Hmm. [00:29:40] Cached? [00:29:45] might be rsyslog buffering on you. give it a bit [00:30:04] same for mw1189 which just reported over 100,000 instances [00:30:16] yeah [00:31:17] Krenair: I am about to take down phab, yes [00:31:38] if there isn't any objection? [00:31:55] would've been a few minutes ago, fine with me now [00:34:14] (03PS1) 10Alex Monk: Fix Iff676395: Remove extra dollar mark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226453 [00:34:29] (03CR) 10Alex Monk: [C: 032] "Already in prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226453 (owner: 10Alex Monk) [00:34:35] (03Merged) 10jenkins-bot: Fix Iff676395: Remove extra dollar mark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226453 (owner: 10Alex Monk) [00:37:09] Krenair: with extension registration, you no longer need the $wg = $wmg hack [00:38:48] RECOVERY - puppet last run on cp3041 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:44:31] actually, I guess there's more we can do [00:45:12] (03CR) 10Alex Monk: [C: 032] Fix fdcwiki's wgMetaNamespace to not be Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225515 (https://phabricator.wikimedia.org/T106188) (owner: 10Matanya) [00:45:48] (03Merged) 10jenkins-bot: Fix fdcwiki's wgMetaNamespace to not be Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225515 (https://phabricator.wikimedia.org/T106188) (owner: 10Matanya) [00:46:22] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/225515/ (duration: 00m 12s) [00:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:48:45] (03CR) 10Alex Monk: [C: 04-1] "We need to make it merge in $wgVisualEditorNamespaces" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226414 (owner: 10Alex Monk) [00:49:49] (03CR) 10Alex Monk: "Do you want to just do this at some point, Chris? It's not clear who would really need to approve it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [00:50:21] !log deployed kartotherian fix, still not starting as a service, and no idea why. Have no access to logs. Frustrated. [00:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:54:28] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [01:00:53] phab down? [01:01:01] Request: GET http://phabricator.wikimedia.org/T45250, from 10.64.0.172 via cp1044 cp1044 ([10.64.0.172]:80), Varnish XID 764221907 [01:01:01] Forwarded for: 99.186.41.108, 10.64.0.172 [01:01:01] Error: 503, Service Unavailable at Thu, 23 Jul 2015 01:00:55 GMT [01:01:25] twentyafterfour: are you doing something? [01:01:36] yes [01:01:39] he's upgrading phab [01:01:40] it's scheduled [01:01:50] !log twentyafterfour is upgrading phabricator [01:01:55] twentyafterfour: please log it [01:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:03:57] !log ori Synchronized php-1.26wmf14/includes/libs/objectcache/APCBagOStuff.php: I4b2cf1715 (duration: 00m 12s) [01:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:04:59] !log phab is back [01:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:07:16] legoktm: sorry, I'll try to remember the log next time. I did silence icinga, at least [01:07:22] thanks :) [01:08:37] (03PS1) 10Tim Landscheidt: Labs: Fix puppetmaster::certcleaner for self-hosted puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/226455 (https://phabricator.wikimedia.org/T106627) [01:09:44] So I think that I want to move the configuration of phabricator tags into etcd, is there a way, currently, to read & write values in etcd from tin? [01:10:14] twentyafterfour: yes. You can use HTTP PUT and GET requests [01:10:37] cool [01:10:59] e.g. curl -k -L https://etcd1001.eqiad.wmnet:2379/v2/keys/conftool?recursive=true [01:11:02] I found documentation for conftool but not much else, and it seems conftool is palladium only [01:11:35] probably best to ask on the ops list. I'm sure if you have to use conftool or not [01:12:04] *I'm not sure [01:20:24] 7Puppet, 6Labs: Puppet Trebuchet provider compares refname with commit sha1 and does NOT refresh the git repo! - https://phabricator.wikimedia.org/T77002#1473832 (10ori) @Hashar is this still an issue? [01:25:45] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1473834 (10Yurik) 3NEW [01:26:13] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1473842 (10Yurik) Relates to T106637 [01:27:38] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 705.514208089 [01:29:53] any ops can help? The kartotherian service is in a service restart loop on all maps-test200{1-4}.codfw.wmnet, and i can't get to the logs [01:30:33] yurik: where are the logs? [01:31:22] ori, i am guessing it is syslog [01:31:38] because the service's own logs are empty (in /var/logs/kartotherian) [01:34:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 311048 MB (10% inode=99%) [01:34:49] there's a lot of: [01:34:51] Jul 23 01:33:19 maps-test2001 systemd[14971]: Failed at step USER spawning /usr/bin/nodejs: No such process [01:34:51] Jul 23 01:33:19 maps-test2001 systemd[1]: kartotherian.service: main process exited, code=exited, status=217/USER [01:34:51] Jul 23 01:33:19 maps-test2001 systemd[1]: Unit kartotherian.service entered failed state. [01:37:06] (03PS1) 10Tim Landscheidt: cassandra: Fix strict puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/226459 (https://phabricator.wikimedia.org/T87132) [01:37:56] yurik: /var/log/karthotherian is not empty on maps-test2001 [01:38:07] ori, that's old stuff [01:38:15] see dates [01:39:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 305759 MB (10% inode=99%) [01:40:11] (03CR) 10Tim Landscheidt: "I was not sure if "${var}" did some magic integer -> string conversion, so I tested it with:" [puppet] - 10https://gerrit.wikimedia.org/r/226459 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [01:44:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 306437 MB (10% inode=99%) [01:44:57] !log ori Synchronized php-1.26wmf14/includes/libs/objectcache/APCBagOStuff.php: I4b2cf1715538 (duration: 00m 12s) [01:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:45:14] !log ori Synchronized php-1.26wmf15/includes/libs/objectcache/APCBagOStuff.php: I4b2cf1715538 (duration: 00m 12s) [01:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:49:20] ori, do you see anything in the /var/log/syslog? [01:49:34] or maybe daemon? [01:49:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300293 MB (10% inode=99%) [01:54:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 304389 MB (10% inode=99%) [01:58:38] (03PS1) 10Springle: depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226461 [01:59:03] (03CR) 10Springle: [C: 032] depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226461 (owner: 10Springle) [01:59:09] (03Merged) 10jenkins-bot: depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226461 (owner: 10Springle) [01:59:47] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 300293 MB (10% inode=99%) [02:00:37] 'zh-min-nanwikisource' => 'Wiki T��-su-k�an', [02:00:37] 'zh-min-nanwikisource' => 'Wiki_T��-su-k�an', [02:00:44] from wmf-config/InitialiseSettings.php [02:01:24] wgSitename and wgMetaNamespace [02:02:09] (03PS1) 10Tim Landscheidt: statsdlb: Fix strict puppet-lint check [puppet] - 10https://gerrit.wikimedia.org/r/226463 (https://phabricator.wikimedia.org/T87132) [02:03:03] !log LocalisationUpdate failed (1.26wmf14) at 2015-07-23 02:03:02+00:00 [02:03:03] !log LocalisationUpdate failed (1.26wmf15) at 2015-07-23 02:03:03+00:00 [02:03:07] ^015f5b7 (Catrope 2012-02-24 17:16:16 -0800 1907) 'zh-min-nanwikisource' => 'Wiki T��-su-k�an', [02:03:07] 286e096e (Dereckson 2012-10-19 00:48:05 +0200 2386) 'zh-min-nanwikisource' => 'Wiki_T��-su-k�an', [02:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:04:23] if you ignore 286e096e which was just changing a space to _, that second one also points back to the SVN import [02:04:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [02:05:00] !log springle Synchronized wmf-config/db-eqiad.php: depool db1070 (duration: 00m 12s) [02:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:06:50] (03CR) 10Tim Landscheidt: "I first tested "validate_re($backend_ports, '^\d+(\s\d+)*$', […]" (i. e., "${var}" => $var), but that failed the validation. I tested the" [puppet] - 10https://gerrit.wikimedia.org/r/226463 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [02:07:12] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 23 02:07:12 UTC 2015 (duration 7m 11s) [02:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:09:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [02:14:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [02:15:42] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1473886 (10Manybubbles) I believe it was semi arbitrary and based on budget. We have 15 nice machines in the eqiad cluster and 16... [02:19:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [02:24:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [02:29:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [02:34:07] !log l10nupdate Synchronized php-1.26wmf14/cache/l10n: (no message) (duration: 07m 13s) [02:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [02:37:10] ACKNOWLEDGEMENT - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874224 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%): Sean Pringle Seems to be a problem, but not an imminent explosion? Pausing #ops icinga noise. - The acknowledgement expires at: 2015-07-23 08:35:02. [02:37:55] !log LocalisationUpdate completed (1.26wmf14) at 2015-07-23 02:37:55+00:00 [02:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:39:51] !log temporarily silenced backup4001 check_disk space icinga noise; seems important, but not exploding-any-minute-now [02:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:15] (03PS2) 10Alex Monk: Get rid of default=wikipedia assumptions in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225287 (https://phabricator.wikimedia.org/T104088) [02:52:08] (03PS3) 10Alex Monk: Get rid of default=wikipedia assumptions in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225287 (https://phabricator.wikimedia.org/T104088) [02:55:08] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61479 bytes in 0.116 second response time [03:00:48] !log l10nupdate Synchronized php-1.26wmf15/cache/l10n: (no message) (duration: 07m 24s) [03:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:02:08] interesting [03:02:20] I have found references to diqwiktionary, bbdwikimedia, yiwikinews, and liwikinews [03:02:35] none of which have any other records [03:04:48] !log LocalisationUpdate completed (1.26wmf15) at 2015-07-23 03:04:48+00:00 [03:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:07:58] (03PS1) 10Alex Monk: Remove reference to nonexistent ru_sibwiki.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226469 [03:29:28] (03PS1) 10Alex Monk: Remove/fix references to non-existent wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226470 [03:31:13] 6operations, 7Easy: Update people.wikimedia.org with the 2015 Wikimania group photo - https://phabricator.wikimedia.org/T106598#1473916 (10Danny_B) >>! In T106598#1473096, @Krenair wrote: > I'd use https://commons.wikimedia.org/wiki/File:Wikimania_2015_%E2%80%93_Hackathon_group_photo.jpg this year actually, si... [03:32:34] Sigh, even more nonsense [03:33:01] ilwikimedia, noboard_chapterswikimedia and arbcom_dewiki's wgLogo paths do not exist [03:37:53] long term static asset cache is 30 days, right? [03:42:20] 365 days [03:43:24] (03PS1) 10Alex Monk: Set ilwikimedia, noboard_chapterswikimedia and arbcom_dewiki's logos to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226471 [03:52:31] 6operations, 6Analytics-Backlog, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1473921 (10greg) Sorted differently (branch order): ``` 1.26wmf1 3 1.26wmf4 8 1.26wmf5 1 1.26wmf7 5 1.26wmf8 1 1.26wmf10 2 1... [04:04:11] !log upgrade & reboot db1070 [04:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:24:44] !log ori Synchronized php-1.26wmf14/extensions/Scribunto/common/Base.php: (no message) (duration: 00m 12s) [04:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:25:07] !log ori Synchronized php-1.26wmf15/extensions/Scribunto/common/Base.php: (no message) (duration: 00m 13s) [04:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:40:47] PROBLEM - Disk space on ms-be1005 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdj1 is not accessible: Input/output error [04:41:17] PROBLEM - RAID on ms-be1005 is CRITICAL 1 failed LD(s) (Offline) [05:00:16] 6operations, 6Multimedia, 7Performance: Choose a sensible set of thumbnail sizes for Special:Preferences - https://phabricator.wikimedia.org/T106640#1473929 (10Whatamidoing-WMF) 3NEW [05:01:01] 6operations, 6Multimedia, 10Wikimedia-Site-requests, 7Performance: Choose a sensible set of thumbnail sizes for Special:Preferences - https://phabricator.wikimedia.org/T106640#1473941 (10Glaisher) [05:01:04] 6operations, 6Multimedia, 6Performance-Team, 10Wikimedia-Site-requests, and 2 others: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1473939 (10Whatamidoing-WMF) Thank you, @Gilles. I have created {T106640} for the next step in cleaning up the l... [05:02:58] 6operations, 6Multimedia, 10Wikimedia-Site-requests, 7Performance: Choose a sensible set of thumbnail sizes for Special:Preferences - https://phabricator.wikimedia.org/T106640#1473942 (10Whatamidoing-WMF) >>! In T65440#1445230, @Edokter wrote: > I think we should build on multiples of 120 and 160. That sho... [05:03:38] PROBLEM - puppet last run on ms-be1005 is CRITICAL Puppet has 1 failures [05:45:38] 6operations, 6Analytics-Backlog, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1473986 (10mmodell) I think it's clear that we need to abandon the practice of branching & changing URL prefixes each week. [05:47:01] 6operations, 6Analytics-Backlog, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1473987 (10mmodell) One naive solution would be to replace old branches with symlinks to a current branch. This would mostly s... [06:06:31] 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1474014 (10Springle) 3NEW a:3Springle [06:13:08] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [06:24:38] (03PS1) 10Glaisher: Set $wgCategoryCollation to 'uca-default' on cswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226483 (https://phabricator.wikimedia.org/T106337) [06:29:58] PROBLEM - puppet last run on cp1053 is CRITICAL Puppet has 2 failures [06:31:10] 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1474052 (10jcrespo) > dbstore s7 has showed strange problems recently (T104471) That was not a "strange problem", that was a misconfiguration error due to wrong filtering. [06:31:58] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 2 failures [06:32:08] PROBLEM - puppet last run on mw1215 is CRITICAL Puppet has 1 failures [06:32:38] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures [06:32:47] PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 2 failures [06:32:48] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 1 failures [06:32:49] PROBLEM - puppet last run on db2060 is CRITICAL Puppet has 1 failures [06:33:08] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 1 failures [06:33:08] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 2 failures [06:33:08] PROBLEM - puppet last run on mw2036 is CRITICAL Puppet has 1 failures [06:33:28] PROBLEM - puppet last run on mw2158 is CRITICAL Puppet has 2 failures [06:33:28] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 3 failures [06:39:06] (03PS1) 10Springle: repool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226484 [06:40:55] (03CR) 10Springle: [C: 032] repool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226484 (owner: 10Springle) [06:41:02] (03Merged) 10jenkins-bot: repool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226484 (owner: 10Springle) [06:42:26] !log springle Synchronized wmf-config/db-eqiad.php: repool db1070, warm up (duration: 00m 13s) [06:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:43:18] 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1474059 (10jcrespo) Also, dbstore100[12] issues where different, where a `stop slave; start slave;` fixed replication because false duplicate key errors. [06:46:28] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1474061 (10mmodell) >>! In T99096#1458885, @Krinkle wrote: > If that seemed possible I would've done... [06:51:18] 6operations, 10Deployment-Systems, 7HHVM: HHVM lock-ups - https://phabricator.wikimedia.org/T89912#1474062 (10Joe) No, this specific type of locking has gone away with HHVM 3.6.x; Closing the ticket. [06:51:25] 6operations, 10Deployment-Systems, 7HHVM: HHVM lock-ups - https://phabricator.wikimedia.org/T89912#1474063 (10Joe) 5Open>3Resolved [06:56:18] RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:56:38] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on mw2036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:38] RECOVERY - puppet last run on cp1053 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:28] RECOVERY - puppet last run on db2060 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:58:38] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:07] RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:59:07] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:37] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:59:38] RECOVERY - puppet last run on mw1215 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:44] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 23 06:59:44 UTC 2015 (duration 59m 43s) [06:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:00:18] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:27] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:14] (03PS1) 10Giuseppe Lavagetto: admin: grant access to stat1003 and eventlogging to legoktm [puppet] - 10https://gerrit.wikimedia.org/r/226486 (https://phabricator.wikimedia.org/T106184) [07:03:56] 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1474075 (10jcrespo) Possibilities: * If the delete happened at the time of the previous import, it could be just a re-population error * If the delete happened just before, maybe... [07:04:47] RECOVERY - Router interfaces on cr1-eqiad is OK host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [07:09:20] (03PS2) 10Giuseppe Lavagetto: admin: grant access to stat1003 and eventlogging to legoktm [puppet] - 10https://gerrit.wikimedia.org/r/226486 (https://phabricator.wikimedia.org/T106184) [07:10:34] (03CR) 10Giuseppe Lavagetto: [C: 032] admin: grant access to stat1003 and eventlogging to legoktm [puppet] - 10https://gerrit.wikimedia.org/r/226486 (https://phabricator.wikimedia.org/T106184) (owner: 10Giuseppe Lavagetto) [07:13:39] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003 and eventlogging for legoktm - https://phabricator.wikimedia.org/T106184#1474081 (10Joe) 5Open>3Resolved [07:13:50] _joe_: thanks :) [07:13:51] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003 and eventlogging for legoktm - https://phabricator.wikimedia.org/T106184#1461337 (10Joe) a:3Joe [07:13:59] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003 and eventlogging for legoktm - https://phabricator.wikimedia.org/T106184#1461337 (10Joe) [07:14:01] 10Ops-Access-Reviews, 6operations: Review access to stat1003, eventlogging for legoktm - https://phabricator.wikimedia.org/T106315#1474083 (10Joe) 5Open>3Resolved [07:14:37] 6operations, 10MediaWiki-ResourceLoader, 7HHVM, 5MW-1.26-release, and 2 others: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1474085 (10Joe) 5Open>3Resolved [07:28:39] <_joe_> !log upgrading hhvm on the canary appservers [07:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:30:13] (03PS2) 10Muehlenhoff: Remove firejail conditional [puppet] - 10https://gerrit.wikimedia.org/r/226273 (https://phabricator.wikimedia.org/T101870) [07:30:30] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove firejail conditional [puppet] - 10https://gerrit.wikimedia.org/r/226273 (https://phabricator.wikimedia.org/T101870) (owner: 10Muehlenhoff) [07:35:10] 6operations, 6Services, 5Patch-For-Review: Service containment for nodejs-based services with firejail - https://phabricator.wikimedia.org/T101870#1474119 (10MoritzMuehlenhoff) 5Open>3Resolved firejail is now enabled by default for service::node [07:36:50] (03PS1) 10Matanya: access: New production ssh key for awight [puppet] - 10https://gerrit.wikimedia.org/r/226488 [07:53:11] (03PS1) 10Matanya: access: shell account for Srijan Kumar [puppet] - 10https://gerrit.wikimedia.org/r/226491 [07:54:49] (03PS1) 10Matanya: access: grant srijan access to stat1003 via research group [puppet] - 10https://gerrit.wikimedia.org/r/226492 [07:55:21] (03CR) 10Matanya: "depends on https://gerrit.wikimedia.org/r/226491" [puppet] - 10https://gerrit.wikimedia.org/r/226492 (owner: 10Matanya) [07:55:38] (03CR) 10jenkins-bot: [V: 04-1] access: grant srijan access to stat1003 via research group [puppet] - 10https://gerrit.wikimedia.org/r/226492 (owner: 10Matanya) [07:55:57] 6operations, 7HHVM, 5Patch-For-Review: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1474137 (10Joe) Canary appservers updated. [08:04:17] _joe_: have you upgraded HHMV on the beta cluster as well ? [08:04:37] _joe_: I can handle it if you want [08:14:26] (03PS1) 10Giuseppe Lavagetto: service::checker: add support for other HTTP verbs [puppet] - 10https://gerrit.wikimedia.org/r/226497 [08:14:54] <_joe_> hashar: yes, did yesterday [08:14:57] <_joe_> (and logged it) [08:15:04] <_joe_> mobrovac: ^^ [08:15:19] (03CR) 10jenkins-bot: [V: 04-1] service::checker: add support for other HTTP verbs [puppet] - 10https://gerrit.wikimedia.org/r/226497 (owner: 10Giuseppe Lavagetto) [08:15:45] _joe_: awesome! [08:35:42] 6operations: Update libzmq3/pyzmq - https://phabricator.wikimedia.org/T106093#1474174 (10MoritzMuehlenhoff) zeromq 4.0.5 has been backported to precise/trusty pyzmq 14.4 has been backported to trusty and 14.0 to precise (no 14.4 due to a lack of dh-python in precise) Tests went fine in my labs setup with Salt... [08:39:47] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [08:42:55] oh [08:44:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [08:48:38] 6operations: Update libzmq3/pyzmq - https://phabricator.wikimedia.org/T106093#1474192 (10hashar) Reapplying the cherry-picked changes: https://gerrit.wikimedia.org/r/#q,Ib206f3820e5aa11ff6d26e777609b5692f74dd4f,n,z https://gerrit.wikimedia.org/r/#q,I5c0a1fd283cc545ab073a514e933a87d5fe23996,n,z https://gerrit.wi... [08:49:47] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [08:50:33] hashar:^ I guess you mean to add this to a different bug [08:50:55] moritzm: the back up issue ? [08:51:47] oh yeah sorry [08:54:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [08:55:10] for some reason I can't ssh to backup4001, trying console [08:55:18] (03PS8) 10Hashar: beta: Add script from Jenkins beta-update-databases [puppet] - 10https://gerrit.wikimedia.org/r/210618 (https://phabricator.wikimedia.org/T96199) (owner: 10Thcipriani) [08:59:24] hashar: no, you pasted the cherrypicked commits to the zeromq/pyzmw bug, they probably belong somewhere else [08:59:45] moritzm: yup fixed, sorry. I was lurking at that task [08:59:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [08:59:54] nodepool using zmq as well. Turns out it is on Jessie already [09:00:09] godog: backup4001 is not in manifests/site.pp, are these defined somewhere else? [09:00:09] there might be some side effect on the event logging system which uses some python zmq module as well :-( [09:00:36] hashar: the jessie releases are up-to-date, this is only for precise/trusty [09:00:43] yup [09:01:08] the cluster wide module versions are kind of a mess :-/ [09:04:18] moritzm: mhh curious, if icinga knows about it then puppet has to know about it as well, perhaps it got just the defauls [09:04:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [09:09:43] (03CR) 10Mobrovac: [C: 04-1] "Only setting the method is not going to be enough for POST and friends, unfortunately, as body fields need to be encoded. As per https://u" [puppet] - 10https://gerrit.wikimedia.org/r/226497 (owner: 10Giuseppe Lavagetto) [09:09:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [09:14:47] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [09:19:49] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [09:19:57] 6operations, 10ops-eqiad: ms-be1005.eqiad.wmnet: slot=9 dev=sdj failed - https://phabricator.wikimedia.org/T106654#1474304 (10fgiunchedi) 3NEW [09:20:58] RECOVERY - Disk space on ms-be1005 is OK: DISK OK [09:24:30] I need a new career path [09:24:39] IT is not for me really : Error: Could not retrieve catalog from remote server: Error 400 on SERVER: stack level too deep [09:24:42] (on labs dont worry) [09:24:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [09:29:49] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [09:34:49] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [09:39:47] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [09:42:26] unless there's someone able to look at backup4001 I'm going to silence it, I can't look at it ATM [09:42:43] (03PS1) 10Muehlenhoff: Add ferm rules for puppet master backends [puppet] - 10https://gerrit.wikimedia.org/r/226501 [09:43:14] (03PS1) 10Filippo Giunchedi: admin: researchers group access for srijan [puppet] - 10https://gerrit.wikimedia.org/r/226502 (https://phabricator.wikimedia.org/T106407) [09:44:49] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [09:45:11] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1474423 (10fgiunchedi) >>! In T106407#1472569, @leila wrote: > @fgiunchedi > >>>! In T106407#1470142, @fgiunchedi wrote: >> * @srijan are you employed by WMF or vo... [09:47:20] 6operations, 10Beta-Cluster: puppet fail on deployment-mx - https://phabricator.wikimedia.org/T106660#1474435 (10hashar) 3NEW [09:49:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [09:54:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [09:59:47] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [10:04:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [10:09:48] PROBLEM - check_disk on backup4001 is CRITICAL: DISK CRITICAL - free space: / 874227 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 301469 MB (10% inode=99%) [10:10:54] (03PS2) 1020after4: Check for l10n cache before sync-wikiversions [tools/scap] - 10https://gerrit.wikimedia.org/r/226353 (https://phabricator.wikimedia.org/T100573) [10:11:20] (03CR) 10jenkins-bot: [V: 04-1] Check for l10n cache before sync-wikiversions [tools/scap] - 10https://gerrit.wikimedia.org/r/226353 (https://phabricator.wikimedia.org/T100573) (owner: 1020after4) [10:13:30] (03PS1) 10Muehlenhoff: Add ferm rules for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/226506 (https://phabricator.wikimedia.org/T104972) [10:20:28] (03PS3) 1020after4: Check for l10n cache before sync-wikiversions [tools/scap] - 10https://gerrit.wikimedia.org/r/226353 (https://phabricator.wikimedia.org/T100573) [10:21:26] (03CR) 1020after4: "Revised to significantly improve the error messages." [tools/scap] - 10https://gerrit.wikimedia.org/r/226353 (https://phabricator.wikimedia.org/T100573) (owner: 1020after4) [10:26:00] (03PS1) 10Muehlenhoff: Add ferm rules for HHVM admin site [puppet] - 10https://gerrit.wikimedia.org/r/226507 [10:30:19] 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1474487 (10jcrespo) {P1047} @springle Did you do any out-of-band changes to the table after the second replication error? (I do not care if you did, *I did out of band changes to c... [10:34:09] (03CR) 10Matanya: "dup of https://gerrit.wikimedia.org/r/#/c/226492/" [puppet] - 10https://gerrit.wikimedia.org/r/226502 (https://phabricator.wikimedia.org/T106407) (owner: 10Filippo Giunchedi) [10:34:15] (03PS2) 10Giuseppe Lavagetto: service::checker: add support for other HTTP verbs [puppet] - 10https://gerrit.wikimedia.org/r/226497 [10:37:40] (03Abandoned) 10Filippo Giunchedi: admin: researchers group access for srijan [puppet] - 10https://gerrit.wikimedia.org/r/226502 (https://phabricator.wikimedia.org/T106407) (owner: 10Filippo Giunchedi) [10:37:45] (03PS2) 10Filippo Giunchedi: access: grant srijan access to stat1003 via research group [puppet] - 10https://gerrit.wikimedia.org/r/226492 (owner: 10Matanya) [10:38:32] (03CR) 10jenkins-bot: [V: 04-1] access: grant srijan access to stat1003 via research group [puppet] - 10https://gerrit.wikimedia.org/r/226492 (owner: 10Matanya) [10:38:34] matanya: ^ missed it because of wrong bug: header, updated in https://gerrit.wikimedia.org/r/#/c/226492/ [10:44:22] 6operations, 10Wikimedia-DNS, 7Mail: Set up role accounts and feedback loops (FBL) with all providers - https://phabricator.wikimedia.org/T106664#1474530 (10Nemo_bis) 3NEW [10:48:53] jynus, around? the kartotherian service is not starting on maps cluster, possibly due to perms. Could you help debug? [10:48:53] 6operations, 7Database: mariadb multi-source replication glitch with site_identifiers - https://phabricator.wikimedia.org/T106647#1474545 (10jcrespo) [10:48:55] 6operations, 7Database: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1474544 (10jcrespo) [10:49:30] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1474548 (10fgiunchedi) update on this, I was talking to @MoritzMuehlenhoff and since ffmpeg will b... [10:50:26] yurik, will do in a second [10:50:30] thx! [10:50:31] 6operations, 7Database: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1437197 (10jcrespo) Adding replication error as a blocking task, let's not rush and discard any replication issues with 10 masters first. [10:51:18] PROBLEM - puppet last run on mw2127 is CRITICAL puppet fail [10:52:12] yurik, what do you mean with "does not start"? is there a ticket? [10:52:44] we updated the start method yesterday [10:53:17] jynus, i updated the code yesterday, and not sure if that was the cause - i don't have sudo, and can't debug it [10:53:27] i would need to walk you through it [10:53:45] ok, lets talk in private to not flood this channel [10:56:57] !log disabling puppet on maps-test hosts to debug service issue [10:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:57:35] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: New production ssh key for awight - https://phabricator.wikimedia.org/T106625#1474561 (10fgiunchedi) the "lytho" key has been replaced in {T105563} about 10 days ago, what happened to the key we are replacing now? [10:57:44] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: New production ssh key for awight - https://phabricator.wikimedia.org/T106625#1474563 (10fgiunchedi) p:5Triage>3Normal [11:00:03] 6operations: create script for consistency checks / advice which machine to pick as a new mysql master if master is down - https://phabricator.wikimedia.org/T79713#1474565 (10fgiunchedi) 5Open>3Resolved [11:00:51] 6operations: lvs servers report 'Memory allocation problem' on bootup - https://phabricator.wikimedia.org/T82849#1474571 (10fgiunchedi) 5Open>3Resolved [11:03:16] 6operations: Make Puppet run NICEd on all servers - https://phabricator.wikimedia.org/T78848#1474578 (10fgiunchedi) [11:03:27] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1474583 (10mark) Hi Yuri, Can we first try to make partial access sufficient? That's generally a far better/more secure way of dealing wit... [11:04:01] 6operations: Make Puppet run NICEd on all servers - https://phabricator.wikimedia.org/T78848#1474585 (10fgiunchedi) we've puppet 3 rolled out now, so we could run puppet agent nice'd in theory without affecting services spawned by puppet itself [11:17:38] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:33:35] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: kartotherian service does not start on maps-test cluster - https://phabricator.wikimedia.org/T106667#1474600 (10Yurik) 3NEW a:3jcrespo [12:05:03] 6operations, 10OCG-General-or-Unknown: Ferm rules for ocg hosts - https://phabricator.wikimedia.org/T104976#1474663 (10MoritzMuehlenhoff) 5Open>3Resolved a:3MoritzMuehlenhoff The ocg* hosts are already covered (it was initially overlooked, since base::firewall is included in the role definition) [12:15:01] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1474695 (10Yurik) Today during debugging, we used the following sudo rights: * suspend puppet * edit /etc/kartotherian/config.yaml to enab... [12:15:53] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1474698 (10Yurik) [12:53:41] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: kartotherian service does not start on maps-test cluster - https://phabricator.wikimedia.org/T106667#1474819 (10jcrespo) p:5Unbreak!>3Normal This is not "unbreak now", this is not production: "maps-test2001". [13:02:51] (03PS3) 10Giuseppe Lavagetto: service::checker: add support for other HTTP verbs [puppet] - 10https://gerrit.wikimedia.org/r/226497 [13:08:25] !log graphoid deploying 81b9633 [13:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:09:47] (03CR) 10Eevans: Cassanra logstash setup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [13:11:21] 6operations, 6Services, 5Patch-For-Review, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1474855 (10mobrovac) [13:12:41] (03PS14) 10Eevans: Cassanra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) [13:17:43] (03PS4) 10Giuseppe Lavagetto: service::checker: add support for other HTTP verbs [puppet] - 10https://gerrit.wikimedia.org/r/226497 [13:19:48] RECOVERY - check_disk on backup4001 is OK: DISK OK - free space: / 874226 MB (99% inode=99%): /dev 7991 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 8000 MB (100% inode=99%): /archive 799674 MB (27% inode=99%) [13:21:54] jynus: hi [13:22:13] (03CR) 10Mobrovac: [C: 031] "Graphoid, Mathoid and Citoid all report:" [puppet] - 10https://gerrit.wikimedia.org/r/226497 (owner: 10Giuseppe Lavagetto) [13:22:23] https://phabricator.wikimedia.org/T106682 [13:25:23] (03CR) 10Giuseppe Lavagetto: [V: 032] service::checker: add support for other HTTP verbs [puppet] - 10https://gerrit.wikimedia.org/r/226497 (owner: 10Giuseppe Lavagetto) [13:25:32] (03CR) 10Giuseppe Lavagetto: [C: 032] service::checker: add support for other HTTP verbs [puppet] - 10https://gerrit.wikimedia.org/r/226497 (owner: 10Giuseppe Lavagetto) [13:29:20] (03PS1) 10Giuseppe Lavagetto: graphoid: enable spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/226527 (https://phabricator.wikimedia.org/T94821) [13:30:08] (03CR) 10jenkins-bot: [V: 04-1] graphoid: enable spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/226527 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [13:32:20] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: kartotherian service does not start on maps-test cluster - https://phabricator.wikimedia.org/T106667#1474907 (10Yurik) Related P1048 [13:45:09] (03PS2) 10Giuseppe Lavagetto: graphoid: enable spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/226527 (https://phabricator.wikimedia.org/T94821) [13:47:47] PROBLEM - High load average on ms-be1003 is CRITICAL - load average: 334.22, 222.37, 106.20 [13:50:48] (03CR) 10Giuseppe Lavagetto: [C: 032] graphoid: enable spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/226527 (https://phabricator.wikimedia.org/T94821) (owner: 10Giuseppe Lavagetto) [13:51:48] (03PS1) 10Ottomata: Provision analytics1042-1045 as analytics worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/226530 [13:52:09] (03PS2) 10Ottomata: Provision analytics1042-1045 as analytics worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/226530 [13:56:08] (03CR) 10Ottomata: [C: 032] Provision analytics1042-1045 as analytics worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/226530 (owner: 10Ottomata) [13:58:01] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1474953 (10mark) Filippo (on Clinic Duty) and Jaime/Moritz are seeing if they can fix this today. [14:01:39] 6operations: Migrate access-requests@ from RT to Phabricator - https://phabricator.wikimedia.org/T84861#1474966 (10Aklapper) a:3mark >! In T84861#1365698, @chasemp wrote: >>! In T84861#1321088, @Dzahn wrote: > Since new requests have been migrated to phab, do we still plan to import the existing old tickets be... [14:08:07] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [14:09:27] 6operations: Migrate access-requests@ from RT to Phabricator - https://phabricator.wikimedia.org/T84861#1475012 (10mark) a:5mark>3Aklapper I'd be inclined to say, not really necessary, as long as we keep a DB backup of RT just in case. [14:09:43] <_joe_> uhm what's up with icinga, checking [14:10:47] 6operations, 7Documentation: Update wiki documentation related to RT - https://phabricator.wikimedia.org/T76990#1475014 (10Aklapper) #operations: Anything specifically left to do here or can this be closed? [14:11:13] <_joe_> ottomata: Error: Could not find any host matching 'analytics1045' [14:11:18] <_joe_> in neon's logs [14:12:07] puppet probably just needs to run on neon [14:12:07] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [14:13:16] running puppet [14:15:40] 6operations, 6Labs, 6Multimedia, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1475035 (10Aklapper) 5Open>3Resolved With the given links in this task and its duplicates I cannot find any broken thumbnails anymore.... [14:19:05] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [14:19:52] (03CR) 10Mobrovac: [C: 031] Cassanra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [14:25:23] Hallo [14:25:32] Is this the right channel for SWAT? [14:31:55] (03PS1) 10Muehlenhoff: Don't use firejail for systemd-based services yet. kartotherian needs to be fixed to work with firejail first. [puppet] - 10https://gerrit.wikimedia.org/r/226534 [14:31:57] yes aharoni [14:32:15] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [14:32:18] jouncebot: next [14:32:18] In 0 hour(s) and 27 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150723T1500) [14:32:25] <_joe_> moritzm: just kartotherian or all systemd services? [14:32:25] aharoni: ^ [14:32:28] RECOVERY - Host mw2027 is UPING WARNING - Packet loss = 28%, RTA = 43.43 ms [14:32:40] (03CR) 10Filippo Giunchedi: [C: 031] Don't use firejail for systemd-based services yet. kartotherian needs to be fixed to work with firejail first. [puppet] - 10https://gerrit.wikimedia.org/r/226534 (owner: 10Muehlenhoff) [14:33:04] _joe_: all node+systemd afaict [14:33:12] <_joe_> ok [14:33:17] well "our" node services [14:33:25] matanya: gracias [14:33:31] so much fun today [14:33:34] config changes [14:33:35] <_joe_> godog: uhm ok [14:33:49] a big new frequently-request ContentTranslation feature [14:33:53] all the others (mathoid/graphoid/citoid) work fine (and use upstart, but that's unrelated) [14:33:56] frequently-requested [14:34:06] godog: sorry and thanks for pointing out i had the wrong bug # [14:34:11] <_joe_> moritzm: so your patch is wrong [14:34:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I am about to release a new service on a jessie host, "mobileapps", and I'd like it to use firejail from day one." [puppet] - 10https://gerrit.wikimedia.org/r/226534 (owner: 10Muehlenhoff) [14:35:09] <_joe_> well, not wrong, but if the problem is not systemd, let's just revert the patch that removed the conditional [14:39:47] need to debug some more [14:42:09] I will work on maps-test2002 [14:58:18] * aharoni just realized that with a bit of imagination "chatzilla" is very similar for the Hebrew word for "eggplant" [15:00:04] manybubbles anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150723T1500). [15:00:04] aude: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:24] can swat [15:00:34] aude: aharoni around for SWAT? [15:00:38] hi thcipriani [15:00:39] yes [15:00:45] okie doke [15:00:46] two rather simple config changes [15:01:24] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225322 (owner: 10Amire80) [15:01:47] aharoni: When I use my alternate IRC nick "molliug", d.omas makes fun of me because it's very close to "moliūgas", Lithuanian for "pumpkin". [15:01:54] (03Merged) 10jenkins-bot: Add wgSitename and wgMetaNamespace for pnbwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225322 (owner: 10Amire80) [15:02:43] Merged! So the of https://pnb.wikipedia.org is supposed to change. After scap? [15:03:13] <_joe_> aharoni: surely not before of that :) [15:04:12] <Mjbmr> can you please let me add some more patches before scap? [15:04:43] <logmsgbot> !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Add wgSitename and wgMetaNamespace for pnbwikipedia [[gerrit:225322]] (duration: 00m 12s) [15:04:50] <thcipriani> ^ aharoni check please [15:04:50] <hashar> is that SWAT already ? [15:04:51] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:10] <thcipriani> hashar: SWAT indeed. [15:05:23] <hashar> i felt short upgrading Zuul :-( [15:05:34] * hashar shakes fist at .deb packaging toolchain [15:05:46] <Mjbmr> thcipriani: I'm added some backports, please check. [15:05:56] <thcipriani> Mjbmr: kk [15:06:03] <Mjbmr> 'v [15:06:03] <Mjbmr> Thanks [15:06:11] <aharoni> thcipriani: I still don't see a change [15:06:34] <thcipriani> Anyone know about this error: Undefined variable: wmgVisualEditorAutoAccountEnable in /srv/mediawiki/wmf-config/CommonSettings.php on line 2037 [15:07:08] <aharoni> hehe, sounds related to VE's A/B testing [15:08:00] <Mjbmr> thcipriani: https://gerrit.wikimedia.org/r/226453 [15:09:06] <icinga-wm> RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [15:09:14] <thcipriani> hmm, yeah, that's probably it, it's definitely looming large in the error count [15:09:27] <thcipriani> aharoni: looks like that code should be synced out [15:09:39] <thcipriani> does that require a full scap to update? [15:09:59] <aharoni> I don't know [15:10:42] <aharoni> My understanding of ops is very low. [15:10:50] <matanya> aharoni: i'd wait 10 minutes [15:10:55] <aharoni> ok [15:10:58] <matanya> let the entire cluster get it [15:12:25] <grrrit-wm> (03PS2) 10Muehlenhoff: Don't use firejail for systemd-based services yet. systemd refuses to start setuid binaries (such as firejail) if User= or Group= are specified in the unit file. That can probably be configured, but let's sort that out later. [puppet] - 10https://gerrit.wikimedia.org/r/226534 [15:12:29] <grrrit-wm> (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224031 (https://phabricator.wikimedia.org/T105327) (owner: 10Amire80) [15:12:40] <wikibugs> 6operations: suspected opendj file descriptor leak on neptunium - https://phabricator.wikimedia.org/T84082#1475229 (10fgiunchedi) [15:12:56] <grrrit-wm> (03Merged) 10jenkins-bot: Set a different wmgContentTranslationDefaultSourceLanguage for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224031 (https://phabricator.wikimedia.org/T105327) (owner: 10Amire80) [15:13:34] <paravoid> moritzm: that commit msg looks funny :) [15:13:43] <wikibugs> 6operations: suspected opendj file descriptor leak on neptunium - https://phabricator.wikimedia.org/T84082#1475236 (10fgiunchedi) p:5Normal>3Low might be still the case, needs to be checked again ``` neptunium:~$ sudo lsof -p 28336 | tail -20 java 28336 opendj 5213u IPv6 928997600 0t0... [15:14:47] <logmsgbot> !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Set a different wmgContentTranslationDefaultSourceLanguage for English part I [[gerrit:224031]] (duration: 00m 13s) [15:14:51] <Mjbmr> aharoni: pnbwiki not pnbwikipedia [15:14:53] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:18] <logmsgbot> !log thcipriani Synchronized wmf-config/CommonSettings.php: SWAT: Set a different wmgContentTranslationDefaultSourceLanguage for English part II [[gerrit:224031]] (duration: 00m 12s) [15:15:24] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:26] <jynus> wow, moritzm, I do not know how you find that, but thanks [15:15:26] <thcipriani> ^ aharoni check please [15:15:45] <jynus> I can confirm it works with no user/group [15:16:07] <thcipriani> aude: ping for SWAT if you're around [15:16:11] <aude> here [15:16:18] <aharoni> ow [15:16:25] <aharoni> can I fix that quickly now? [15:16:35] <aharoni> should be pnbwiki of course [15:16:44] <aharoni> checking the other patch in the meantime [15:17:06] <icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [15:17:20] <thcipriani> aharoni: yes please post an updated patch [15:17:56] <aharoni> ok [15:19:06] <jynus> but it runs service as root [15:19:44] <thcipriani> Mjbmr: I'm going to do you patches as a group in one scap [15:19:49] <thcipriani> *your [15:19:59] <Mjbmr> ok, thanks. [15:20:02] <aude> thcipriani: ok [15:20:07] <grrrit-wm> (03PS1) 10Amire80: Fix pnbwikipedia to pnbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226543 [15:20:26] <grrrit-wm> (03CR) 10Nikerabbit: [C: 031] Fix pnbwikipedia to pnbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226543 (owner: 10Amire80) [15:20:33] <grrrit-wm> (03PS2) 10coren: Labs: Script to back labstore filesystems up [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) [15:20:39] <aude> oh more patches [15:20:59] <moritzm> jynus: yeah, let's start it without firejail for now and sort how to tweak the unit file later, there's probably an option [15:22:00] <jynus> how hacky would be to let systemctl run a script, that runs suid, that runs node? [15:22:38] <aharoni> thcipriani: hi, are you deploying the pnbwiki fix? [15:23:13] <thcipriani> aharoni: do you have a patch? [15:23:16] <icinga-wm> RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:23:23] <aharoni> yes, https://gerrit.wikimedia.org/r/226543 [15:23:25] <thcipriani> I can likely get it out before scap [15:23:26] <icinga-wm> PROBLEM - swift-object-server on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:36] <icinga-wm> PROBLEM - RAID on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:36] <icinga-wm> PROBLEM - swift-container-replicator on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:46] <icinga-wm> PROBLEM - swift-account-replicator on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:46] <icinga-wm> PROBLEM - SSH on ms-be1003 is CRITICAL - Socket timeout after 10 seconds [15:23:55] <grrrit-wm> (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226543 (owner: 10Amire80) [15:23:56] <icinga-wm> PROBLEM - swift-account-auditor on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:56] <icinga-wm> PROBLEM - dhclient process on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:01] <grrrit-wm> (03Merged) 10jenkins-bot: Fix pnbwikipedia to pnbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226543 (owner: 10Amire80) [15:24:06] <icinga-wm> PROBLEM - puppet last run on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:06] <icinga-wm> PROBLEM - swift-account-server on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:09] <aharoni> thcipriani: is https://gerrit.wikimedia.org/r/#/c/224031/2 supposed to be active already? [15:24:25] <icinga-wm> PROBLEM - salt-minion processes on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:25] <icinga-wm> PROBLEM - swift-object-updater on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:28] <thcipriani> aharoni: yup should be sync'd [15:24:35] <icinga-wm> PROBLEM - swift-account-reaper on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:36] <icinga-wm> PROBLEM - DPKG on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:47] <icinga-wm> PROBLEM - swift-container-updater on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:56] <icinga-wm> PROBLEM - swift-container-auditor on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:06] <icinga-wm> PROBLEM - swift-object-auditor on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:16] <icinga-wm> PROBLEM - swift-container-server on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:17] <icinga-wm> PROBLEM - swift-object-replicator on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:42] <aharoni> thcipriani: OK, the second one works as expected. [15:25:48] <wikibugs> 6operations, 6Labs, 6Multimedia, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1475285 (10Krenair) 5Resolved>3Open based on Special:ListFiles a bunch of SVGs and PDFs are still broken [15:26:24] <logmsgbot> !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Add wgSitename and wgMetaNamespace for pnbwiki [[gerrit:226543]] (duration: 00m 12s) [15:26:31] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:43] <thcipriani> ^ aharoni should work now, sorry should have caught that. [15:27:03] <aharoni> checking [15:27:13] <aharoni> thcipriani: works [15:27:18] <aharoni> that's all, thank you very much [15:27:20] <thcipriani> nice! Thanks! [15:27:27] <aharoni> Mjbmr: thanks for catching my silly bug! [15:27:58] <Mjbmr> aharoni: no worry, thanks for handling wikimedia-site-requests [15:28:02] <moritzm> jynus: that won't work. let's revert to the old behaviour, I'll have a look at the systemd source what's going on there [15:28:07] <Krenair> thcipriani, can we do https://gerrit.wikimedia.org/r/#/c/226469/ and https://gerrit.wikimedia.org/r/#/c/226470/ in this swat? [15:28:41] <jynus> _joe_, are you ok with that or sould we do a conditional? [15:29:02] <thcipriani> Krenair: sure, I can probably get them done before all the core stuff is done merging [15:29:35] <grrrit-wm> (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226469 (owner: 10Alex Monk) [15:29:43] <_joe_> jynus: one sec [15:30:01] <jynus> actually, you have to be ok with that, as yours woldn't work either [15:30:06] <grrrit-wm> (03Merged) 10jenkins-bot: Remove reference to nonexistent ru_sibwiki.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226469 (owner: 10Alex Monk) [15:30:36] <_joe_> jynus: I'm trying to save ms-be1003 atm [15:30:57] <_joe_> but no luck [15:31:03] <wikibugs> 6operations: Migrate parsercache away from being a full RDBMS - https://phabricator.wikimedia.org/T84187#1475301 (10fgiunchedi) [15:31:22] <logmsgbot> !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Remove reference to nonexistent ru_sibwiki.png [[gerrit:226469]] (duration: 00m 14s) [15:31:24] <_joe_> !log rebooting ms-be1003, stuck in kernel locks [15:31:28] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:30] <thcipriani> ^ Krenair one down [15:31:34] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:45] <grrrit-wm> (03PS1) 10Faidon Liambotis: Revoke gage's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/226545 [15:32:20] <grrrit-wm> (03CR) 10Faidon Liambotis: [C: 032 V: 032] Revoke gage's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/226545 (owner: 10Faidon Liambotis) [15:32:36] <jynus> "higher throughput with less overhead and better connection characteristics" [15:33:46] <icinga-wm> RECOVERY - swift-object-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [15:33:56] <icinga-wm> RECOVERY - swift-object-server on ms-be1003 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:34:01] <grrrit-wm> (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226470 (owner: 10Alex Monk) [15:34:06] <icinga-wm> RECOVERY - swift-container-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [15:34:07] <icinga-wm> RECOVERY - RAID on ms-be1003 is OK optimal, 14 logical, 14 physical [15:34:10] <grrrit-wm> (03Merged) 10jenkins-bot: Remove/fix references to non-existent wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226470 (owner: 10Alex Monk) [15:34:16] <icinga-wm> RECOVERY - swift-account-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [15:34:16] <icinga-wm> RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2wmfprecise2 (protocol 2.0) [15:34:20] <godog> jynus: old ticket I've made public, might be irrelevant now :) [15:34:26] <icinga-wm> RECOVERY - dhclient process on ms-be1003 is OK: PROCS OK: 0 processes with command name dhclient [15:34:26] <icinga-wm> RECOVERY - swift-account-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [15:34:30] <grrrit-wm> (03CR) 10Negative24: "@chasemp I'll see what I can salvage from this patch. In the case I abandon it, it can be used as a pattern for what needs to be modified." [puppet] - 10https://gerrit.wikimedia.org/r/222987 (https://phabricator.wikimedia.org/T104827) (owner: 10Negative24) [15:34:35] <icinga-wm> RECOVERY - puppet last run on ms-be1003 is OK Puppet is currently enabled, last run 41 minutes ago with 0 failures [15:34:36] <icinga-wm> RECOVERY - swift-account-server on ms-be1003 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [15:34:37] <icinga-wm> ACKNOWLEDGEMENT - puppet last run on ms-be1005 is CRITICAL Puppet has 1 failures Filippo Giunchedi T106654 [15:34:47] <icinga-wm> RECOVERY - swift-object-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [15:34:47] <icinga-wm> ACKNOWLEDGEMENT - RAID on ms-be1005 is CRITICAL 1 failed LD(s) (Offline) Filippo Giunchedi T106654 [15:34:47] <icinga-wm> RECOVERY - salt-minion processes on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:34:57] <icinga-wm> RECOVERY - swift-account-reaper on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:35:06] <icinga-wm> RECOVERY - DPKG on ms-be1003 is OK: All packages OK [15:35:16] <icinga-wm> RECOVERY - swift-container-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:35:26] <icinga-wm> RECOVERY - High load average on ms-be1003 is OK - load average: 13.73, 6.15, 2.31 [15:35:26] <icinga-wm> RECOVERY - swift-container-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:35:36] <icinga-wm> RECOVERY - swift-object-auditor on ms-be1003 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [15:35:37] <icinga-wm> RECOVERY - swift-container-server on ms-be1003 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [15:35:44] <logmsgbot> !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: fix references to non-existent wikis [[gerrit:226470]] (duration: 00m 13s) [15:35:50] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:53] <thcipriani> ^ Krenair and that's two [15:36:01] <Krenair> thanks [15:36:17] <grrrit-wm> (03PS10) 10Negative24: Phabricator: Create differential puppet role [puppet] - 10https://gerrit.wikimedia.org/r/222987 (https://phabricator.wikimedia.org/T104827) [15:36:27] <wikibugs> 6operations: A puppet run should not start if a box is under abnormal load. - https://phabricator.wikimedia.org/T84183#1475349 (10fgiunchedi) [15:36:33] <_joe_> jynus: sorry what was you asking about? [15:36:39] <_joe_> the systemd firejail disabling? [15:37:19] <jynus> firejail does not work (at least for now) with systemd [15:37:38] <grrrit-wm> (03PS11) 10Negative24: Phabricator: Create diffusion puppet role [puppet] - 10https://gerrit.wikimedia.org/r/222987 (https://phabricator.wikimedia.org/T104827) [15:37:38] <_joe_> ok [15:37:39] <jynus> not only that particular service [15:37:50] <grrrit-wm> (03PS2) 10Alex Monk: Set ilwikimedia, noboard_chapterswikimedia and arbcom_dewiki's logos to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226471 [15:37:58] <jynus> we do not consider this a final patch [15:38:12] <thcipriani> Mjbmr: dang looks like qunit tests failed on one [15:38:14] <grrrit-wm> (03CR) 10Giuseppe Lavagetto: [C: 031] "Since this seems like a systemd+firejail problem, I revert my objection." [puppet] - 10https://gerrit.wikimedia.org/r/226534 (owner: 10Muehlenhoff) [15:38:14] <jynus> but we think for now it is better to not use it [15:38:16] <grrrit-wm> (03PS1) 10Tim Landscheidt: Tools: Remove grid host alias for deleted instance [puppet] - 10https://gerrit.wikimedia.org/r/226546 (https://phabricator.wikimedia.org/T104919) [15:38:37] <grrrit-wm> (03CR) 10Negative24: "Phabricator's names are confusing. Diffusion: repo hosting, Differential: code review." [puppet] - 10https://gerrit.wikimedia.org/r/222987 (https://phabricator.wikimedia.org/T104827) (owner: 10Negative24) [15:38:37] <jynus> we can create a ticket to investigate how to make it work [15:38:44] <moritzm> !log added jenkins-debian-glue 0.13.0 to apt.wikimedia.org (jessie-wikimedia) [15:38:51] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:57] <moritzm> ^ hashar [15:39:14] <jynus> _joe_, because the workaround is to run it as root, which is a big no [15:39:36] <icinga-wm> PROBLEM - Host es2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:09] <jynus> ^ppauls wants to kill me with a heartattack today :-) [15:40:25] <_joe_> you read 1004, right? [15:40:27] <_joe_> I did [15:40:42] <_joe_> gasped, reread, and was about to ask you and papaul about it :P [15:40:48] <jynus> 1004 is that a ticket? [15:40:56] <_joe_> no es1004 [15:41:05] <jynus> ah [15:41:38] <jynus> well "pc", "es" or "db" are "my thing" [15:41:40] <thcipriani> Mjbmr: trying to make it re-try merging that patch [15:41:52] <papaul> jynus: was going on [15:41:56] <Mjbmr> thcipriani: see that, thanks [15:42:21] <jynus> nothing, papaul I know what you are doing, please continue doiing it, and thanks! [15:42:33] <papaul> jynus: ok [15:43:36] <jynus> es2004 is depooled, no problem :-) but all alerts scare me [15:44:41] <wikibugs> 6operations: Identify servers with h310 controllers - https://phabricator.wikimedia.org/T84356#1475384 (10fgiunchedi) [15:44:50] <aude> thcipriani: still swatting? [15:45:04] <thcipriani> aude: yeah, still trying [15:45:06] <aude> ok [15:45:09] <jynus> moritzm, let me merge [15:45:22] <wikibugs> 6operations, 7HHVM, 5Patch-For-Review: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1249847 (10hashar) >>! In T97675#1471267, @Anomie wrote: > Testing on beta, it appears that (once deployed) the new package will fix thi... [15:45:30] <jynus> ^226534, I mean [15:45:53] <thcipriani> just can't figure out how to make jenkins re-try to merge a patch... [15:46:07] <grrrit-wm> (03CR) 10Jcrespo: [C: 032] Don't use firejail for systemd-based services yet. systemd refuses to start setuid binaries (such as firejail) if User= or Group= are specif [puppet] - 10https://gerrit.wikimedia.org/r/226534 (owner: 10Muehlenhoff) [15:46:27] <grrrit-wm> (03PS3) 10Jcrespo: Don't use firejail for systemd-based services yet. systemd refuses to start setuid binaries (such as firejail) if User= or Group= are specified in the unit file. That can probably be configured, but let's sort that out later. [puppet] - 10https://gerrit.wikimedia.org/r/226534 (owner: 10Muehlenhoff) [15:46:31] <wikibugs> 6operations: turn lldp info into puppet facts, mention in MOTD - https://phabricator.wikimedia.org/T84518#1475402 (10fgiunchedi) [15:47:52] <aude> thcipriani: which patch? [15:48:12] <thcipriani> aude: https://gerrit.wikimedia.org/r/#/c/226541 think I'm making progress now [15:48:13] <aude> i think you have to remove your +2 and then readd it (after jenkins has given +2) [15:48:22] * aude ran into issues earlier and that worked [15:48:34] <aude> oh, jenkins disapproves :/ [15:48:56] <aude> timeout [15:49:03] <thcipriani> aude: yeah, I think it was a fluke if you look at the test run, it's retesting now, then I'll remove my +2 and re-add [15:49:28] <moritzm> moritzm: sure, go ahead. otherwise I can do it in five minutes [15:49:29] <grrrit-wm> (03PS4) 10Filippo Giunchedi: Don't use firejail for systemd-based services yet. [puppet] - 10https://gerrit.wikimedia.org/r/226534 (owner: 10Muehlenhoff) [15:49:34] <moritzm> jynus: sure, go ahead. otherwise I can do it in five minutes [15:51:30] <thcipriani> aude: your SWAT patches don't need a scap, right? [15:52:08] <wikibugs> 6operations, 7Monitoring: Icinga should report when it's unable to refresh the config - https://phabricator.wikimedia.org/T83721#1475413 (10fgiunchedi) [15:52:18] <wikibugs> 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure: Update jenkins-debian-glue packages on Jessie to v0.13.0 - https://phabricator.wikimedia.org/T102106#1475416 (10hashar) 5Open>3Resolved Package has been uploaded on apt.wikimedia.org: ``` apt-cache policy jenkins-debian-glue je... [15:53:26] <wikibugs> 6operations, 7Monitoring: Icinga should report when it's unable to refresh the config - https://phabricator.wikimedia.org/T83721#1475423 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi this is happening now, courtesy of `check_icinga_config` [15:54:15] <aude> thcipriani: don't need scap [15:54:28] <wikibugs> 6operations: track package updates available for apt.wikimedia.org - https://phabricator.wikimedia.org/T84235#1475428 (10fgiunchedi) [15:54:36] <wikibugs> 6operations: Extend Wikimedia APT repository with more pinning alternatives - https://phabricator.wikimedia.org/T78948#1475439 (10fgiunchedi) [15:55:12] <thcipriani> aude: I may just lump you into the full scap I have to do for Mjbmr if that's ok [15:55:54] <aude> works for me [15:56:36] <thcipriani> cool. [15:57:19] <thcipriani> kk, now just have to wait for jenkins to get done, then extended SWAT [16:01:55] <wikibugs> 6operations: Investigate firejail for service::node and systemd - https://phabricator.wikimedia.org/T106701#1475457 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [16:02:56] <icinga-wm> PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [16:02:57] <jynus> and sadly, the old .service doesn't work either [16:04:24] <Mjbmr> oh'k, jenkins-bot [16:06:10] <Mjbmr> thcipriani: running scap? [16:06:23] <thcipriani> Mjbmr: not quite yet, still git wrangling on tin [16:07:39] <grrrit-wm> (03CR) 10Merlijn van Deen: [C: 031] Tools: Remove grid host alias for deleted instance [puppet] - 10https://gerrit.wikimedia.org/r/226546 (https://phabricator.wikimedia.org/T104919) (owner: 10Tim Landscheidt) [16:10:22] <logmsgbot> !log thcipriani Started scap: SWAT: Add azb interwiki sorting, Add Southern Luri, and Fix name of S and W Balochi [16:10:29] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:31] <thcipriani> ^ Mjbmr and aude there we go [16:10:45] <aude> thcipriani: thanks [16:10:58] <Mjbmr> thcipriani: thanks [16:14:28] <urandom> !log restarting Cassandra on restbase1001 to (temporarily) enable GC logging [16:14:36] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:36] <logmsgbot> !log thcipriani Finished scap: SWAT: Add azb interwiki sorting, Add Southern Luri, and Fix name of S and W Balochi (duration: 06m 13s) [16:16:43] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:46] <thcipriani> ^ that was suspiciously quick [16:16:57] <icinga-wm> PROBLEM - HHVM rendering on mw1221 is CRITICAL - Socket timeout after 10 seconds [16:17:10] <Mjbmr> thcipriani: all works! thanks again! [16:17:19] <thcipriani> Mjbmr: yw [16:17:26] <thcipriani> aude: everything look ok to you? [16:17:46] <icinga-wm> PROBLEM - Apache HTTP on mw1221 is CRITICAL - Socket timeout after 10 seconds [16:17:56] <Glaisher> hm [16:18:04] <Glaisher> I don't think we did that for other new wikis [16:18:27] <Glaisher> That's probably why the sorting on sidebar interwiki links is wrong. [16:19:41] <aude> thcipriani: i'm sure it's good [16:19:50] * aude checks [16:20:29] <Mjbmr> aude: thanks for handling it. [16:23:55] <aude> looks good [16:24:00] <aude> e.g. https://gom.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%87%E0%A4%B2_%E0%A4%AA%E0%A4%BE%E0%A4%A8 [16:24:06] <icinga-wm> PROBLEM - HHVM queue size on mw1221 is CRITICAL 100.00% of data above the critical threshold [80.0] [16:24:24] <thcipriani> aude: awesome, thanks! [16:24:33] <thcipriani> and now SWAT is complete. [16:24:51] <aude> maybe we need to also add tyv and mai [16:24:59] <aude> not in this swat though [16:25:46] <icinga-wm> PROBLEM - puppet last run on mw1031 is CRITICAL Puppet last ran 13 days ago [16:26:00] <Glaisher> aude: are the interwiki links sorted by wikibase? [16:26:33] <Glaisher> It would probably be better if it was sorted by alphabetical order by default (without having to add each language to the list) [16:26:45] <icinga-wm> PROBLEM - HHVM busy threads on mw1221 is CRITICAL 100.00% of data above the critical threshold [115.2] [16:26:49] <aude> Glaisher: agree [16:27:04] <ori> !log restarted hhvm on mw1221 [16:27:10] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:18] <bd808> thcipriani: without an l10n update scap is wicked fast [16:27:53] <thcipriani> bd808: evidently. [16:27:55] <icinga-wm> RECOVERY - Apache HTTP on mw1221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.223 second response time [16:27:55] <icinga-wm> RECOVERY - puppet last run on mw1031 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:28:41] <Mjbmr> Glaisher: https://gerrit.wikimedia.org/r/221124 [16:28:43] <bd808> If tin had ssds and twice as many cores scap would be fast every time [16:29:06] <icinga-wm> RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 66972 bytes in 1.122 second response time [16:29:14] <matanya> jgage: is the strongswan work all done ? [16:29:27] <icinga-wm> RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:44] <Glaisher> Mjbmr: oh nice [16:30:28] <jgage> matanya: i'm not aware of any further code changes needed, but deployment is on hold [16:30:35] <aude> Glaisher: then we can move the other alternative sortings to mediawiki config [16:30:43] <aude> and just have the default be alphabetical [16:30:52] <Glaisher> right [16:30:56] <matanya> jgage: need any help with anything there ? [16:31:39] <jgage> matanya: thank you! but no, not right now. probably the best person to follow up with would be bblack. [16:32:02] <matanya> thanks much [16:35:06] <icinga-wm> RECOVERY - HHVM busy threads on mw1221 is OK Less than 30.00% above the threshold [76.8] [16:35:38] <_joe_> ori: what was going on on mw1221? [16:36:18] <_joe_> it doesn't happen often nowadays to see a lockup in hhvm [16:36:36] <icinga-wm> RECOVERY - HHVM queue size on mw1221 is OK Less than 30.00% above the threshold [10.0] [16:40:46] <grrrit-wm> (03CR) 10Awight: [C: 04-1] "I'd like to have both these keys active, if possible." [puppet] - 10https://gerrit.wikimedia.org/r/226488 (owner: 10Matanya) [16:41:58] <matanya> awight: thanks ^ will se what i can do. On a side note- i want to second what paravoid said on the mailing list, please don't use GH. [16:42:51] <matanya> oh, sorry, wrong person to address. [16:43:15] <ori> _joe_: i didn't look :( [16:43:19] <awight> matanya: ah, yeah holler if "GH" is something I should decode :) [16:43:31] <matanya> github. [16:43:42] <awight> hehe thx [16:43:55] <matanya> could be Geohack as well [16:46:15] <jynus> do I get a trophy for the smallest piece of code that took me the most to debug? [16:47:24] <wikibugs> 6operations: Investigate firejail for service::node and systemd - https://phabricator.wikimedia.org/T106701#1475685 (10MoritzMuehlenhoff) 5Open>3Invalid Couldn't reproduce this with a different setuid binary and it turned out to be a problem in service::node. [16:48:30] <matanya> one more thing jgage , as the author of the admin module, is it possible to have more than one ssh key tied to a user ? [16:50:33] <jgage> matanya: sorry, i'm not sure what module you're referring to? [16:50:53] <matanya> modules/admin [16:51:02] <jgage> i did not write that :) [16:52:28] <matanya> oh, sorry, i am mixed up today [16:52:57] <jgage> matanya: it does look like more than one ssh key is allowed though, check out uid 2129 [16:53:39] <grrrit-wm> (03PS1) 10Jcrespo: Reenabling firejail on systemd-based services [puppet] - 10https://gerrit.wikimedia.org/r/226559 [16:54:09] <Krenair> yeah, a bunch of people have more than one key [16:54:42] <jynus> ^moritzm, if you are still here, give that a +23 [16:54:47] <_joe_> jynus: ach my fault [16:54:59] <_joe_> damn quotes and systemd [16:55:01] <jynus> no, _joe_ [16:55:16] <jynus> 3 people looked thorougly at the code [16:55:36] <_joe_> still, the code was mine :) [16:55:49] <_joe_> and it's not the first time that bites me [16:56:08] <jynus> well, cannot we ban a regular expression? [16:56:39] <grrrit-wm> (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/226559 (owner: 10Jcrespo) [16:57:16] <grrrit-wm> (03CR) 10Jcrespo: [C: 032] Reenabling firejail on systemd-based services [puppet] - 10https://gerrit.wikimedia.org/r/226559 (owner: 10Jcrespo) [16:57:47] <jynus> _joe_, to give you more reasons why it wasn't your fault, I was the one that should have create that in the fisrt place [16:59:16] <icinga-wm> RECOVERY - Host es2004 is UPING OK - Packet loss = 0%, RTA = 44.03 ms [17:00:27] <jynus> papaul, Thank you for your work! Only good news in the last 5 minutes [17:00:45] <papaul> jynus: you welcome [17:01:47] <jynus> I do not fully understand the whole new workflow on the gerrit for puppet [17:13:58] <wikibugs> 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: kartotherian service does not start on maps-test cluster - https://phabricator.wikimedia.org/T106667#1475797 (10jcrespo) 5Open>3Resolved After a very long debugging session, and involving a total of 3 people from operations, I can close this as f... [17:19:44] <wikibugs> 6operations, 10ops-codfw, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1475819 (10Papaul) Replaced memory in slot A2 and B2 with new memory. The server is reading now 64 GB . But at boot i am getting the message "Unsupported memory configuration. DIMM... [17:19:47] <wikibugs> 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: kartotherian service does not start on maps-test cluster - https://phabricator.wikimedia.org/T106667#1475820 (10Yurik) Awesome news!!! @jcrespo rulez! [17:29:03] <wikibugs> 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1475865 (10jcrespo) @Yurik, as you will see on T106667, sadly the logs you mention would not have been enough to solve the problem (it was... [17:31:26] <icinga-wm> PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 22 failures [17:36:11] <grrrit-wm> (03PS3) 10coren: Labs: Script to back labstore filesystems up [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) [17:36:24] <Coren> YuviPanda: ^^ w/ paramiko and the ssh fix [17:37:06] <grrrit-wm> (03CR) 10jenkins-bot: [V: 04-1] Labs: Script to back labstore filesystems up [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) (owner: 10coren) [17:37:53] <Coren> D'oh. .pp? [17:38:27] <legoktm> !log running foreachwikiindblist /home/legoktm/largebutnotenwiki.dblist populateContentModel.php --ns=all --table=page [17:38:33] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:38:42] <grrrit-wm> (03PS4) 10coren: Labs: Script to back labstore filesystems up [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) [17:40:11] <YuviPanda> Coren: paramiko stuff is sitll commented out [17:41:13] * Coren boggles. [17:41:24] <grrrit-wm> (03PS1) 10Negative24: phab: Add passwd entries for vcs user [puppet] - 10https://gerrit.wikimedia.org/r/226573 [17:41:32] <Coren> I think I broke my git repo with a bad rebase. Lemme try this again. [17:42:04] <grrrit-wm> (03CR) 10jenkins-bot: [V: 04-1] phab: Add passwd entries for vcs user [puppet] - 10https://gerrit.wikimedia.org/r/226573 (owner: 10Negative24) [17:42:19] <grrrit-wm> (03PS5) 10coren: Labs: Script to back labstore filesystems up [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) [17:42:28] <James_F> YuviPanda: Did you mean OOjs or OOjs UI? I guess if the latter I'd need to do the former anyway. [17:42:54] <YuviPanda> James_F: indeed, so halfa.k is using OOjs UI and so it'd have to be both [17:43:11] <aharoni> hallo [17:43:16] <aharoni> train in ~20 minutes? [17:43:27] <James_F> YuviPanda: Aha, sure. OK. [17:43:42] <YuviPanda> James_F: edited title to clarify. [17:44:13] <YuviPanda> James_F: this might allow more tools to use it as well, I think (I can even pin it on the tools.wmflabs.org/cdnjs page if you want) [17:44:24] * James_F nods. [17:44:29] <James_F> Let's first put it in. [17:44:47] <greg-g> aharoni: yeah, need something? [17:45:35] <YuviPanda> James_F: cool. shouldn't be too hard, I think Krinkle already maintains some there [17:45:43] * James_F nods. [17:46:18] <grrrit-wm> (03PS2) 10Negative24: phab: Add passwd entries for vcs user [puppet] - 10https://gerrit.wikimedia.org/r/226573 [17:46:21] <aharoni> greg-g: nothing special, just excited about new ContentTranslation features that are going to be deployed. [17:46:52] <bd808> legoktm: "largebutnotenwiki.dblist" is awesome :) [17:47:04] <grrrit-wm> (03CR) 10Filippo Giunchedi: Cassanra logstash setup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [17:47:12] <legoktm> :P [17:47:14] <grrrit-wm> (03PS1) 10Krinkle: contint: Remove krinkle from contint alert groups [puppet] - 10https://gerrit.wikimedia.org/r/226576 [17:47:17] <grrrit-wm> (03CR) 10jenkins-bot: [V: 04-1] phab: Add passwd entries for vcs user [puppet] - 10https://gerrit.wikimedia.org/r/226573 (owner: 10Negative24) [17:49:03] <greg-g> aharoni: :) :) [17:51:58] <grrrit-wm> (03CR) 10Yuvipanda: [C: 032] contint: Remove krinkle from contint alert groups [puppet] - 10https://gerrit.wikimedia.org/r/226576 (owner: 10Krinkle) [17:52:47] <grrrit-wm> (03PS1) 10Jcrespo: Repooling es2004 after hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226578 [17:53:20] <grrrit-wm> (03PS3) 10Negative24: phab: Add passwd entries for vcs user [puppet] - 10https://gerrit.wikimedia.org/r/226573 [17:53:41] <grrrit-wm> (03PS2) 10Jcrespo: Repooling es2004 after hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226578 [17:53:59] <grrrit-wm> (03CR) 10Jcrespo: [C: 032] Repooling es2004 after hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226578 (owner: 10Jcrespo) [17:56:12] <logmsgbot> !log jynus Synchronized wmf-config/db-eqiad.php: Repooling es2004 after hardware maintenance (duration: 00m 12s) [17:56:18] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:56:39] <logmsgbot> !log jynus Synchronized wmf-config/db-codfw.php: Repooling es2004 after hardware maintenance (duration: 00m 11s) [17:56:45] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:57:19] <Negative24> chasemp: https://gerrit.wikimedia.org/r/#/c/226573/ in addition to your patch [18:00:04] <jouncebot> twentyafterfour greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150723T1800). Please do the needful. [18:00:41] <grrrit-wm> (03CR) 10Yuvipanda: "Also needs systemd units." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) (owner: 10coren) [18:00:45] <YuviPanda> Coren: ^ [18:00:59] <YuviPanda> Coren: python style issues, and 'use long form of all commands' mostly [18:01:27] <aharoni> Respected human :) [18:01:50] <aharoni> I should socialize here more. [18:02:23] <Coren> I always said that jouncebot should go "Human minion, hear an obey" instead. :-) [18:03:16] <Coren> YuviPanda: Not sure I get what you mean by "protect against shell execution" vs the other? [18:03:39] <YuviPanda> which comment? [18:03:39] <twentyafterfour> any reason I shouldn't deploy wmf15 right now? [18:03:39] <aharoni> How much time does it take till the train's changes are actually seen on the live sites? More like two minutes or more like half an hour? [18:04:04] <aude> aharoni: 2 min [18:04:09] <twentyafterfour> aharoni: more like 2 minutes [18:04:19] <YuviPanda> Coren: oh, for paramiko it's just a shell string that's being concatenated together, while for local execution it is a list so properly sanitized by python [18:04:25] <aharoni> nicey [18:05:54] <Coren> YuviPanda: the presence of proper quoting should make the difference not significant except if there is a bug in shlex.quote() [18:06:15] <grrrit-wm> (03PS1) 1020after4: all wikis to 1.26wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226585 [18:06:16] <Mjbmr> aharoni: please rechek https://gerrit.wikimedia.org/r/213839 https://gerrit.wikimedia.org/r/225644 [18:06:33] <Coren> YuviPanda: But yeah; having that be made explicit = good idea [18:07:01] <YuviPanda> Coren: aaarggh, I missed that. fair enough. [18:07:22] <Coren> YuviPanda: That said, the reason the context has both variants is so volume_device works right locally and remotely without code duplication (and possible divergence) [18:07:42] <YuviPanda> Coren: yeah, I missed the quote, is fair enough. would still maybe put a doc on the context object? [18:07:46] <grrrit-wm> (03CR) 1020after4: [C: 032] all wikis to 1.26wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226585 (owner: 1020after4) [18:07:53] <Coren> YuviPanda: Yep. Agreed. [18:07:53] <grrrit-wm> (03Merged) 10jenkins-bot: all wikis to 1.26wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226585 (owner: 1020after4) [18:08:04] <YuviPanda> Coren: just making obviously secure things obviously secure and vice versa. [18:08:50] <grrrit-wm> (03PS15) 10Eevans: Cassandra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) [18:10:11] <logmsgbot> !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: all wikis to 1.26wmf15 [18:10:17] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:22] <twentyafterfour> small spike of dberrors when I sync'd wmf15, but it only lasted a few seconds it seems [18:14:04] <grrrit-wm> (03CR) 10BryanDavis: "Looks pretty good by eyeball. Somehow I messed up my testing VM so that sync-wikiversions fails with and without this patch. :/" (032 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/226353 (https://phabricator.wikimedia.org/T100573) (owner: 1020after4) [18:18:46] <wikibugs> 6operations, 10ops-codfw, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1476067 (10jcrespo) 5stalled>3Resolved I had to repool the server to mediawiki to close everithing. MySQL sees all memory with no problem, and it is not a critical server, so w... [18:23:46] <wikibugs> 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1476090 (10Krinkle) >>! In T99096#1474061, @mmodell wrote: >Can someone enlighten me about how the st... [18:25:32] <grrrit-wm> (03PS2) 10Tim Landscheidt: Ignore warnings about URLs without modules for private repository [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) [18:26:51] <wikibugs> 6operations, 10Beta-Cluster, 10MediaWiki-extensions-GettingStarted: GettingStarted on Beta Cluster periodically loses its Redis index - https://phabricator.wikimedia.org/T100515#1476096 (10Mattflaschen) 5Open>3Resolved a:3Mattflaschen It hasn't recently, but I don't know that it's fixed either. I'll m... [18:26:58] <wikibugs> 6operations, 10Beta-Cluster, 10MediaWiki-extensions-GettingStarted: GettingStarted on Beta Cluster periodically loses its Redis index - https://phabricator.wikimedia.org/T100515#1476101 (10Mattflaschen) a:5Mattflaschen>3None [18:28:14] <icinga-wm> RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [18:28:52] <Krinkle> twentyafterfour: If you're up with resources and dedication for https://phabricator.wikimedia.org/T99096 I'd love to help out where I can (though I'm currently working on RL stuff elsewhere) [18:29:39] <RoanKattouw> twentyafterfour: You've finished the deploy train but still have the window, correct? [18:29:51] <RoanKattouw> twentyafterfour: If so, can I deploy a quick patch that just missed wmf15? [18:31:59] <twentyafterfour> RoanKattouw: correct [18:33:17] <wikibugs> 6operations, 10ops-eqiad, 7Database, 5Patch-For-Review: Remove db1002-db1007 from production - https://phabricator.wikimedia.org/T105768#1476125 (10Cmjohnson) We can decommission, wipe the disks and leave them to be re-purposed for labs or whatever else we need. Once removed and wiped I will add to server... [18:34:03] <icinga-wm> PROBLEM - check_mysql on payments2003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [18:34:58] <twentyafterfour> Krinkle: I am interested in seeing it happen, still trying to fully understand everything that would be involved. If we already call a function to dynamically generate the static URLs then we could conceivably change the url layout without a lot of huge changes [18:35:29] <Krinkle> twentyafterfour: We call a function in some cases, but not all. [18:35:37] <Krinkle> But we can audit it and make it so. [18:38:33] <Krinkle> I can give some guidance there to find them. It's mostly in two or three categories. If we miss any, we can find it by tailing varnish log for requests without our current query parameter after the audit [18:38:48] <Krinkle> twentyafterfour: I'd say, let's first audit them all to use that function, then we can switch it. [18:39:03] <icinga-wm> PROBLEM - check_mysql on payments2003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [18:39:10] <twentyafterfour> Krinkle: I think we would need to build an index of static assets, with original path mapped to the current hash value, and then rename all the files so that the hash is the filename [18:39:41] <twentyafterfour> (the rename and index generation would be part of the deploy) [18:40:03] <icinga-wm> PROBLEM - check_puppetrun on saiph is CRITICAL Puppet has 16 failures [18:40:13] <icinga-wm> PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [18:40:21] <Krinkle> twentyafterfour: Yeah, we can auto build the index though. Tis' basically **/* inside skins/resources/extensions [18:40:38] <Krinkle> those are all publicly exposed and addressable [18:40:50] <Jeff_Green> blarg. icinga alerts are me, fixing [18:43:47] <grrrit-wm> (03CR) 10coren: Labs: Script to back labstore filesystems up (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) (owner: 10coren) [18:45:03] <icinga-wm> PROBLEM - check_puppetrun on saiph is CRITICAL Puppet has 16 failures [18:45:13] <icinga-wm> RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 203 seconds ago with 0 failures [18:45:13] <icinga-wm> PROBLEM - check_mysql on payments2002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [18:45:43] <RoanKattouw> thcipriani, aude: So it looks like https://gerrit.wikimedia.org/r/#/c/226537/ didn't come through in the s [18:45:45] <RoanKattouw> in the SWAT [18:45:52] <RoanKattouw> I'll deploy it now [18:46:11] <wikibugs> 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1476167 (10mmodell) >>! In T99096#1476090, @Krinkle wrote: > One thing to keep in mind as well is tha... [18:46:40] <logmsgbot> !log catrope Synchronized php-1.26wmf15/resources/src/mediawiki.less/mediawiki.ui/mixins.less: Unbreak quiet button styles (duration: 00m 13s) [18:46:47] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:48:04] <RoanKattouw> thcipriani, aude: Strike that, it did make it in, because Gerrit auto-generated the same submodule update and that did get deployed [18:48:57] <thcipriani> RoanKattouw: had me worried about what I sync'd there for a second :) [18:49:28] <wikibugs> 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1476179 (10yuvipanda) 3NEW [18:50:03] <icinga-wm> PROBLEM - check_puppetrun on saiph is CRITICAL Puppet has 16 failures [18:50:13] <icinga-wm> RECOVERY - check_mysql on payments2002 is OK: Uptime: 453 Threads: 1 Questions: 1532 Slow queries: 0 Opens: 126 Flush tables: 1 Open tables: 29 Queries per second avg: 3.381 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [18:50:13] <icinga-wm> RECOVERY - check_mysql on payments2003 is OK: Uptime: 169 Threads: 1 Questions: 1241 Slow queries: 11 Opens: 231 Flush tables: 1 Open tables: 64 Queries per second avg: 7.343 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [18:50:29] <grrrit-wm> (03CR) 10Tim Landscheidt: [C: 04-1] "A number of occurences have been added since PS1, so I need to add ignores for them." [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [18:52:36] <grrrit-wm> (03PS1) 10Ori.livneh: asset-check: use HTTPS urls; use TLSv1 [puppet] - 10https://gerrit.wikimedia.org/r/226598 (https://phabricator.wikimedia.org/T105354) [18:52:56] <grrrit-wm> (03CR) 10Ori.livneh: [C: 032 V: 032] asset-check: use HTTPS urls; use TLSv1 [puppet] - 10https://gerrit.wikimedia.org/r/226598 (https://phabricator.wikimedia.org/T105354) (owner: 10Ori.livneh) [18:55:43] <wikibugs> 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1476215 (10RobH) a:3mark We have 5 of the Dell PowerEdge R420, single Intel Xeon E5-2450 v2 2.50GHz, 16GB Memory, (2) 500GB Disks and 2 of the PowerEdge R420, Dual Intel X... [18:57:46] <wikibugs> 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1476222 (10yuvipanda) [19:00:13] <icinga-wm> RECOVERY - check_puppetrun on saiph is OK Puppet is currently enabled, last run -135 seconds ago with 0 failures [19:00:13] <icinga-wm> RECOVERY - check_raid on payments2002 is OK HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 OK] [19:00:14] <grrrit-wm> (03PS3) 10Tim Landscheidt: Ignore warnings about URLs without modules for private repository [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) [19:01:36] <grrrit-wm> (03PS16) 10Eevans: Cassandra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) [19:05:48] <grrrit-wm> (03CR) 10Yuvipanda: Labs: Script to back labstore filesystems up (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) (owner: 10coren) [19:06:53] <grrrit-wm> (03PS3) 10Alex Monk: Config for replacement of wgVisualEditorNamespaces with an associative array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226414 [19:10:13] <icinga-wm> PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [19:11:28] <YuviPanda> Coren: I responded to the responses with some minor suggestions :) [19:12:30] <Coren> I saw. I don't mind the long options, though I think the reason you want them is misguided. :-) Otherwise, I'll push the new version as soon as I figure out the systemd units. [19:12:33] <grrrit-wm> (03CR) 10Tim Landscheidt: "Tested with:" [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [19:14:32] <aude> RoanKattouw: thcipriani what's this about gerrit doing submodule updates? (i think i heard something about it...) [19:15:06] <RoanKattouw> aude: Gerrit auto-generates the submodule updates now for repos that aren't named VisualEditor [19:15:11] <grrrit-wm> (03PS2) 10Alex Monk: Add HTTPS variants for RSS feed whitelists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222691 (https://phabricator.wikimedia.org/T104727) (owner: 10Jeremyb) [19:15:13] <icinga-wm> RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 135 seconds ago with 0 failures [19:15:15] <aude> oh [19:17:20] <grrrit-wm> (03CR) 10Legoktm: "Caused T106311" [dns] - 10https://gerrit.wikimedia.org/r/218904 (owner: 10BBlack) [19:25:50] <grrrit-wm> (03PS1) 10Catrope: Enable Flow on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226608 [19:28:06] <Krenair> I've noticed issues with the deployment calendar [19:28:21] <Krenair> It fails to handle the user's timezone properly [19:28:26] <Krenair> Uncaught TypeError: Cannot read property '1' of null [19:28:55] <Krenair> https://wikitech.wikimedia.org/wiki/MediaWiki:Common.js - in that first sfTz set [19:29:42] <Krenair> $(el).siblings('.deploycal-time-sf').children() returns multiple elements, two of which are <time datetime="something">, but .attr('datetime') returns undefined [19:31:18] <Krenair> I wonder if https://wikitech.wikimedia.org/w/index.php?title=MediaWiki:Common.js&diff=next&oldid=169991 broke it [19:31:26] <wikibugs> 6operations, 10Wikimedia-Stream: stream.wikimedia.org: Uneven distribution of client connections on backends - https://phabricator.wikimedia.org/T69957#1476299 (10Krinkle) [19:31:50] <RoanKattouw> Krenair: Seems likely [19:32:01] <RoanKattouw> Krinkle: You stand accused of breaking the deployment calendar timezone conversion JS ---^^ [19:32:08] <grrrit-wm> (03CR) 10Jforrester: [C: 031] Config for replacement of wgVisualEditorNamespaces with an associative array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226414 (owner: 10Alex Monk) [19:32:55] <Krenair> hah [19:32:59] <RoanKattouw> Krinkle: ...and of causing a JS error on every load of the deployments page, breaking VE on that page [19:33:09] <James_F> Krinkle: J'accuse! ;-) [19:33:39] <RoanKattouw> "Cannot read property '1' of null" [19:34:02] <Krinkle> RoanKattouw: Why would VE break. Did we regress again from loading after done/fail instead of only after done? [19:34:03] <ottomata> paravoid: yt? [19:34:16] <Krinkle> Looking into js now [19:34:22] <RoanKattouw> Krinkle: That never worked, because site is globalEval-ed [19:34:45] <RoanKattouw> Krenair correctly flags that that .siblings() call returns multiple elements [19:34:59] <Krinkle> RoanKattouw: It's not, it's a script tag, which means separate call stack and VE continues fine [19:35:01] <RoanKattouw> Most have the datetime attribute, but the first one is a <small> that doesn't, so .attr( 'datetime' ) correctly returns null [19:35:06] <RoanKattouw> Hmm [19:35:23] <RoanKattouw> Krinkle: Oh then I think what I remember is that we use .always() on a promise that is never rejected [19:35:55] <RoanKattouw> It's impossible for the state of 'site' to be set to 'error'. If there is a JS error in 'site', state( 'ready' ) will never be reached and the state will be 'loading' forever [19:38:41] <Krenair> I think my edit fixed it [19:39:53] <RoanKattouw> Yeah, probably [19:40:17] <RoanKattouw> Yup [19:40:18] <RoanKattouw> Thanks Krenair [19:40:26] <Krinkle> yeah, children() included a <br> [19:40:31] <Krinkle> it needed the eq(1) [19:40:33] <Krinkle> thanks [19:40:50] <Krinkle> RoanKattouw: Is there a bug for site staying unresolved in case of error? [19:41:00] <RoanKattouw> Hmm, not sure [19:41:03] <Krinkle> seems like we can wait for onload, and if state() isn't called, assume error [19:41:33] <Krinkle> In debug mode we use the same to always assume success for each file [19:41:53] <wikibugs> 6operations, 10Wikimedia-Site-requests, 5Patch-For-Review: Extension RSS fails to connect to feeds - https://phabricator.wikimedia.org/T90513#1476347 (10Krenair) I'll note them here anyway. The ones broken due to redirects are WMUA's blog feed, which https://gerrit.wikimedia.org/r/#/c/222691/ will fix, and M... [19:45:10] <RoanKattouw> I can't find a bug, I'll file one in a minute [19:52:01] <grrrit-wm> (03PS2) 10Matanya: access: New production ssh key for awight [puppet] - 10https://gerrit.wikimedia.org/r/226488 [19:56:24] <wikibugs> 10Ops-Access-Requests, 6operations, 6Reading-Admin: Requesting access to stat1002 (Hadoop / HDFS / Hue) for tbayer - https://phabricator.wikimedia.org/T105748#1476386 (10Tbayer) Pasting manager approval below: From: Terence Gilbey <tgilbey@wikimedia.org> Date: July 23, 2015 at 07:48:26 PDT To: Toby Negrin... [19:57:04] <wikibugs> 6operations: Migrate access-requests@ from RT to Phabricator - https://phabricator.wikimedia.org/T84861#1476388 (10Aklapper) a:5Aklapper>3None [19:57:31] <wikibugs> 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1476390 (10RobH) a:5RobH>3Andrew [20:00:53] <icinga-wm> PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 121 failures [20:03:05] <AaronSchulz> webVideoTranscode: 0 queued; 70247 claimed (27864 active, 42383 abandoned); 0 delayed [20:03:08] <AaronSchulz> brion: that looks sad :/ [20:04:30] <RoanKattouw> Krinkle: I've filed https://phabricator.wikimedia.org/T106736 [20:04:45] <grrrit-wm> (03PS1) 10Matanya: access: grant Tilman Bayer access the Analytics data [puppet] - 10https://gerrit.wikimedia.org/r/226615 [20:09:44] <legoktm> matanya: ok, so it looks like we need to split the video? [20:11:11] <matanya> legoktm: i fear so [20:11:29] <matanya> it is already spiltted from the original [20:11:44] <matanya> it was 16.5 GB [20:12:43] <legoktm> :| [20:13:32] <wikibugs> 6operations: Conftool and etcd should represent boolean values as booleans, not 'yes' / 'no' - https://phabricator.wikimedia.org/T106738#1476443 (10ori) 3NEW a:3Joe [20:13:45] <legoktm> matanya: is splitting...easy? [20:15:43] <matanya> legoktm: yes, ffmpeg INFILENAME -ss STARTTIME -t DURATION -acodec copy -vcodec copy OUTFILENAME [20:16:08] <matanya> but it is annoying to have a talk broken into two [20:16:51] <grrrit-wm> (03PS17) 10BryanDavis: Cassandra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [20:17:03] <bd808> urandom: ^ [20:17:10] <bd808> I think that will work better [20:20:14] <greg-g> heh, VE is barfing on editing the mw:Developers/Maintainers page tables [20:20:25] <greg-g> oops, wrong chan, but I'll leave it [20:20:27] <RoanKattouw> Barfing how? [20:20:33] <bd808> another reason that page should die ;) [20:21:16] <greg-g> RoanKattouw: bafing is a lazy word, symptoms: Fx asking me whether I want to kill a script, twice now. Finally loaded [20:21:35] <greg-g> bd808: but, lists! tables! relationships! categories! it has all the things we love! [20:21:38] <RoanKattouw> lol [20:22:10] <bd808> you mean all the things that are created once and then rot forever... [20:22:30] <greg-g> I see enough HotCat changes to not believe that [20:26:54] <icinga-wm> RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [20:33:34] <wikibugs> 6operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, and 3 others: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1476517 (10cwdent) a:5AndyRussG>3cwdent [20:33:44] <twentyafterfour> !log deployed hotfix for T106716, restarted apache on iridium [20:33:51] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:35:48] <grrrit-wm> (03CR) 10Awight: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/226488 (owner: 10Matanya) [20:38:52] <grrrit-wm> (03PS1) 10Ori.livneh: navtiming.py: add firstPaint metric [puppet] - 10https://gerrit.wikimedia.org/r/226621 [20:39:03] <grrrit-wm> (03CR) 10Ori.livneh: [C: 032 V: 032] navtiming.py: add firstPaint metric [puppet] - 10https://gerrit.wikimedia.org/r/226621 (owner: 10Ori.livneh) [20:44:38] <grrrit-wm> (03CR) 10Eevans: [C: 031] Cassandra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [20:45:45] <grrrit-wm> (03CR) 10BryanDavis: [C: 031] "Logstash output looks pretty good in beta cluster now." [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [21:02:17] <bd808> greg-g: fyi -- https://wikitech.wikimedia.org/wiki/Deployments#Friday.2C.C2.A0July.C2.A024 -- logstash upgrades tomorrow [21:02:53] <greg-g> bd808: whoa, going to take all day? or just safety? [21:03:20] <bd808> If it only takes 7 hrs that will be the fastest one yet [21:03:48] <bd808> I'm hoping for much faster but no real idea [21:04:06] <bd808> 1.6.0 is supposed to have fast recovery magic if I do the right things [21:04:26] <bd808> the last one took 12 hours I think [21:04:34] <greg-g> jeebus [21:04:36] <bd808> one before that took 23 [21:04:42] <greg-g> I'm fine with it, but you going to be ok? [21:04:46] <bd808> oh yeah [21:05:05] <bd808> it's mostly doing other things and checking back every 30 mins or so [21:05:23] <bd808> the cirrus cluster takes days [21:05:27] <greg-g> yeah, is there a backup to you? [21:05:42] <bd808> nope [21:05:44] <hashar> yeah stackoverflow *grin* [21:07:08] <hashar> bd808: more seriously, I noticed wmflabs has a logstash project albeit empty. Was it meant to offer a common logstash service for labs projects? [21:07:24] <hashar> bd808: or is that just your playground area? (I am curious) [21:07:45] <bd808> it was where I tested before getting beta cluster set up. I killed all the instances a while ago [21:07:59] <bd808> I'll be making new stuff there today though [21:08:14] <bd808> I need to change all the things to upgrade logstash to 1.5.3 [21:08:54] <bd808> I think I'm going to setup 2 vms there to log irc channels [21:09:11] <bd808> and then make a tool to show a better SAL than wikitech :) [21:09:55] <bd808> YuviPanda and I have evil plans for a labs wide logstash system too but haven't worked on it for real yet [21:10:35] <bd808> My POC for tailing logstash from the command line is part of it -- https://github.com/bd808/ggml [21:11:17] <bd808> next bit is an authn/z proxy in go to put in front of elasticsearch [21:12:04] <icinga-wm> PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail [21:12:14] <greg-g> "and then make a tool to show a better SAL than wikitech :) [21:12:18] <greg-g> " ++++ [21:12:46] <hashar> bd808: ah taking over SAL would be very nice [21:12:50] <bd808> greg-g: https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/SAL [21:13:13] <hashar> and if one day you want to fill disk quickly, we could send Jenkins console logs to logstash :D [21:14:24] <Negative24> whats wrong with the current SAL [21:14:38] <hashar> Negative24: what is good with the current one ? :D [21:14:49] <Negative24> it does its job [21:14:57] <hashar> it is a huge wiki bloat that we manually archive [21:15:01] <bd808> hard to search and doesn't auto archive are the two complaints I hear most [21:15:16] <Negative24> good point [21:15:35] <hashar> bd808: do you have an irc bot listening for !log entries so ? [21:15:41] <bd808> yes [21:15:54] <icinga-wm> PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [21:15:59] <greg-g> bd808: I love you. [21:16:19] <hashar> well [21:16:28] <greg-g> :) [21:16:32] <hashar> ANNOUNCE IT !!!!!!!!!!! [21:16:33] <bd808> I've been logging them in beta for almost a year [21:16:38] * hashar bookmarks [21:16:51] <bd808> but the index there only keeps 30 days [21:17:04] <bd808> so the new project needs to hold much more [21:17:41] <bd808> logstash-deploym: say hello to everyone :) [21:18:00] <bd808> it's a shy bot [21:18:27] <hashar> potentially you could have them written to a different index with an infinite retention time can't you ? [21:18:30] <bd808> I actually want to log not just !logs but full channels [21:18:33] <hashar> logstash-deploym: i love you [21:18:43] <grrrit-wm> (03PS1) 10Awight: Try pointing to a /beacon URL rather than Special:RecordImpression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226642 (https://phabricator.wikimedia.org/T106624) [21:18:56] <bd808> hashar: yes. it's just configuration. [21:19:05] <hashar> !log is already a nice improvement [21:19:12] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:19:17] <hashar> bah [21:19:20] <hashar> bd808: kudos :) [21:19:47] <hashar> with that awesome news, I can no sleep and have a nice logging dream [21:20:11] <awight> Can anyone kick this for me? https://gerrit.wikimedia.org/r/#/c/226488/ [21:21:06] <wikibugs> 6operations, 10Wikimedia-Logstash: Update Elasticsearch on logstash* to elasticsearch-1.7.0.deb - https://phabricator.wikimedia.org/T106126#1476760 (10bd808) a:3bd808 Scheduled to start 2015-07-24T16:00Z [21:21:24] <wikibugs> 6operations, 10Wikimedia-Logstash, 15User-Bd808-Test: Update Elasticsearch on logstash* to elasticsearch-1.7.0.deb - https://phabricator.wikimedia.org/T106126#1476763 (10bd808) [21:27:19] <Negative24> bd808: do you have the src for logstash-deploym published? [21:27:59] <bd808> Negative24: its the built in irc input for logstash. -- http://logstash.net/docs/1.4.2/inputs/irc [21:28:23] <Negative24> bd808: ok cool [21:28:47] <bd808> Negative24: plus this config from ops/puppet -- https://github.com/wikimedia/operations-puppet/blob/production/files/logstash/filter-irc-banglog.conf [21:32:52] <wikibugs> 6operations: Add tmux to maps (or other) servers - https://phabricator.wikimedia.org/T106191#1476834 (10Yurik) Note: tmux is already installed on maps-test2001, but not on others [21:33:11] <wikibugs> 6operations: Add tmux to maps (or other) servers - https://phabricator.wikimedia.org/T106191#1476835 (10Yurik) a:5akosiaris>3None [21:35:06] <YuviPanda> Coren: I see you've marked things as 'done' is there a new patchset gerrit missed? [21:35:49] <Coren> YuviPanda: No, there will be one soon though now that I think I've got my head around systemd units. I think. [21:36:00] <YuviPanda> ah cool :) [21:36:14] <Coren> YuviPanda: I can push a partial changeset w/ the script changes if you want but no units yet [21:36:22] <YuviPanda> Coren: that'll be good too! [21:37:09] <Coren> Gimme a few then [21:38:04] <icinga-wm> RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:43:47] <grrrit-wm> (03PS6) 10coren: Labs: Script to back labstore filesystems up [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) [21:43:54] <Coren> YuviPanda: ^^ w/out systemd units [21:46:45] <YuviPanda> Coren: looking [21:47:23] <arlolra> can someone "kick gitblit" for me (hashar's words) ... php is failing here https://gerrit.wikimedia.org/r/#/c/226119/ [21:47:48] <grrrit-wm> (03CR) 10Eevans: [C: 031] "LGTM, be seem my comment about integer validation." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226459 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:48:10] <arlolra> maybe godog ^ [21:54:46] <wikibugs> 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1476955 (10TJones) @fgiunchedi — When should I expect this to be live? I'm trying to figure out if my inability to connect is coming from misconfiguration on my end, or because t... [22:01:26] <grrrit-wm> (03PS4) 1020after4: Check for l10n cache before sync-wikiversions [tools/scap] - 10https://gerrit.wikimedia.org/r/226353 (https://phabricator.wikimedia.org/T100573) [22:03:06] <grrrit-wm> (03CR) 1020after4: "ok now it passes the 3rd argument" [tools/scap] - 10https://gerrit.wikimedia.org/r/226353 (https://phabricator.wikimedia.org/T100573) (owner: 1020after4) [22:03:48] <grrrit-wm> (03CR) 1020after4: Check for l10n cache before sync-wikiversions (032 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/226353 (https://phabricator.wikimedia.org/T100573) (owner: 1020after4) [22:07:36] <icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0] [22:11:26] <grrrit-wm> (03PS2) 10Gergő Tisza: Fix graphite keys in API dashboard [puppet] - 10https://gerrit.wikimedia.org/r/223659 (https://phabricator.wikimedia.org/T85841) [22:13:15] <icinga-wm> PROBLEM - Disk space on hafnium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=71%) [22:14:26] <ori> hafnium is me, fixing [22:16:19] <wikibugs> 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1477033 (10Matanya) This is live you should be able to access unless my patch is broken somehow. [22:17:15] <icinga-wm> RECOVERY - Disk space on hafnium is OK: DISK OK [22:17:34] <icinga-wm> RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:18:11] <spagewmf> https://git.wikimedia.org eventually times out with 503 , known issue? [22:18:46] <bd808> spagewmf: arlolra reported that too. [22:18:50] <matanya> Trey314159: an educated guess: you didn't specify with which key to log in ? [22:19:02] <bd808> any root up for the restart gitblit dance? [22:19:26] <matanya> bd808: just use phab :) [22:19:32] <bd808> spagewmf: https://phabricator.wikimedia.org/diffusion/ [22:19:46] <grrrit-wm> (03PS3) 10Gergő Tisza: [WIP] Update apache rules for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225552 [22:19:51] <grrrit-wm> (03CR) 10jenkins-bot: [V: 04-1] [WIP] Update apache rules for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [22:19:54] <bd808> matanya: yes. but we need to kill gitblit or keep it up until then [22:19:57] <spagewmf> bd808: thx. Yup, https://phabricator.wikimedia.org/T101358 is to update {{git file}}. [22:20:03] <grrrit-wm> (03PS4) 10Gergő Tisza: [WIP] Update apache rules for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225552 [22:20:10] <grrrit-wm> (03CR) 10jenkins-bot: [V: 04-1] [WIP] Update apache rules for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [22:21:17] <matanya> bd808: totally. poke @ ostriches [22:21:36] <bd808> I know the answer: working on it [22:21:48] <grrrit-wm> (03PS1) 10Ori.livneh: Add role::cache::kafka::banner [puppet] - 10https://gerrit.wikimedia.org/r/226646 [22:21:59] <ori> awight: ^^ [22:23:24] <wikibugs> 6operations: tin doesn't have access to same memcached as terbium and app servers - https://phabricator.wikimedia.org/T103198#1477068 (10Mattflaschen) If tin is not meant to be able to do MW-y things, `mwscript` should be disabled there. That would solve a lot of the confusion. [22:24:38] <wikibugs> 6operations, 6Labs, 6Multimedia, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1477077 (10Bawolff) PDFs I would guess is missing the pdfinfo command line tool (popler-utils package i think). [This is a guess though] S... [22:25:07] <grrrit-wm> (03CR) 10Ori.livneh: "@Awight: Please take a look at line 9 of modules/role/manifests/cache/kafka/banner.pp, which specifies the format string to use for genera" [puppet] - 10https://gerrit.wikimedia.org/r/226646 (owner: 10Ori.livneh) [22:29:13] <wikibugs> 6operations, 6Labs, 6Multimedia, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1477112 (10Bawolff) >>! In T93041#1371054, @Krenair wrote: > Ugh, why another wikitech-specific hack? What about @bawolff's comment? There... [22:29:19] <wikibugs> 6operations, 6Labs, 6Multimedia, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1477113 (10Krenair) >>! In T93041#1477077, @Bawolff wrote: > PDFs I would guess is missing the pdfinfo command line tool (popler-utils pack... [22:36:20] <wikibugs> 6operations, 6Labs, 6Multimedia, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1477185 (10Krenair) >>! In T93041#1477077, @Bawolff wrote: > I believe that --no-external-files is a flag that comes from a WMF specific pa... [22:50:25] <wikibugs> 6operations, 6Security: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1477261 (10Krenair) [22:50:32] <wikibugs> 6operations, 6Labs, 6Multimedia, 10wikitech.wikimedia.org, and 2 others: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1477263 (10Krenair) See also T104147, T80392 [22:54:58] <grrrit-wm> (03CR) 10Eevans: "s/be seem my/but seem my/ (sheesh)" [puppet] - 10https://gerrit.wikimedia.org/r/226459 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [22:56:11] <grrrit-wm> (03PS1) 10Ori.livneh: asset-check: compute payload size correctly [puppet] - 10https://gerrit.wikimedia.org/r/226648 [22:56:18] <wikibugs> 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 5 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1477292 (10Legoktm) >>! In T84842#1473663, @Krenair wrote: > ```2015-07-23 00:06:38 mw1153 commonswiki exception ERROR: [d68c4bb1] /w/thumb_handler... [22:58:36] <grrrit-wm> (03CR) 10Ori.livneh: [C: 032] "Timo, FYI." [puppet] - 10https://gerrit.wikimedia.org/r/226648 (owner: 10Ori.livneh) [22:59:16] <wikibugs> 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 5 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1477299 (10bd808) >>! In T84842#1473663, @Krenair wrote: > ```2015-07-23 00:06:38 mw1153 commonswiki exception ERROR: [d68c4bb1] /w/thumb_handler.p... [23:00:05] <jouncebot> RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150723T2300). [23:00:05] <jouncebot> RoanKattouw ebernhardson Krenair MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:14] <RoanKattouw> Alrighty, SWAT time [23:00:15] * MaxSem is here [23:00:43] <RoanKattouw> And for that MaxSem gets a cookie [23:00:47] <RoanKattouw> i.e. to have his change deployed first [23:00:51] * ebernhardson is also here, somehow :) [23:01:03] <grrrit-wm> (03CR) 10Catrope: [C: 032] Enable geo features tracking everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226253 (owner: 10MaxSem) [23:01:08] <MaxSem> yummy! [23:01:12] * rmoen rmoen is also here but it looks like RoanKattouw will do today :) [23:01:21] <Krenair> MaxSem also added the 9th patch after everybody else :p [23:01:31] <grrrit-wm> (03Merged) 10jenkins-bot: Enable geo features tracking everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226253 (owner: 10MaxSem) [23:01:35] <RoanKattouw> Yeah I usually do it on days my own patches are in there [23:01:46] <rmoen> Understood [23:02:13] <Krenair> (swat is *supposed* to be max 8 patches, but I don't think people actually care at this point) [23:02:27] <RoanKattouw> meh [23:02:30] <ebernhardson> all depends how long it takes to deploy [23:02:37] <logmsgbot> !log catrope Synchronized wmf-config/InitialiseSettings.php: Enable geo feature usage tracking on all wikis (duration: 00m 12s) [23:02:37] <RoanKattouw> Do separate wmf14 and wmf15 cherry-picks really count? [23:02:42] <ebernhardson> the 8 way added because one day i deployed something like 12 patches and swat ran 45 minutes over [23:02:44] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:02:50] <greg-g> wow, holy cow, that's a lot of last minute patches, not 1.5 hours ago it was empty :) [23:02:50] <brion> AaronSchulz: that does seem like a lot of abandoned items; might take a look to see if the code path for dupes & unneeded files is properly marked... [23:03:08] <RoanKattouw> greg-g: No it wasn't, I put my patch in there before lunch didn't I? [23:03:09] <Krenair> I was prompted to add one. [23:03:10] <RoanKattouw> Or am I crazy? [23:03:12] <Krenair> Therefore I added three. [23:03:18] * Krenair denies everything [23:03:26] <greg-g> RoanKattouw: or I haven't reloaded that tab for longer than I thought :) [23:03:33] <ebernhardson> greg-g: it couldn't have been, because i put mine in there this morning :P [23:03:49] <greg-g> I... was mistaken then good sirs [23:04:08] <RoanKattouw> And mine was the first one in there, before ebernhardson's [23:04:12] <RoanKattouw> MaxSem: Your patch is live [23:04:29] <MaxSem> thanks, looking [23:05:08] <RoanKattouw> Krenair: Re https://gerrit.wikimedia.org/r/#/c/226414 [23:05:24] <RoanKattouw> Oh wait [23:05:28] <RoanKattouw> *Available*Namespaces [23:05:42] <Krenair> yep. [23:05:46] <RoanKattouw> OK that's fine then [23:05:51] <grrrit-wm> (03CR) 10Catrope: [C: 032] Config for replacement of wgVisualEditorNamespaces with an associative array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226414 (owner: 10Alex Monk) [23:05:57] <grrrit-wm> (03Merged) 10jenkins-bot: Config for replacement of wgVisualEditorNamespaces with an associative array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226414 (owner: 10Alex Monk) [23:05:57] <Krenair> should be a no-op [23:06:03] <icinga-wm> RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61453 bytes in 0.062 second response time [23:07:38] <grrrit-wm> (03PS2) 10Yuvipanda: Tools: Remove grid host alias for deleted instance [puppet] - 10https://gerrit.wikimedia.org/r/226546 (https://phabricator.wikimedia.org/T104919) (owner: 10Tim Landscheidt) [23:07:44] <grrrit-wm> (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Remove grid host alias for deleted instance [puppet] - 10https://gerrit.wikimedia.org/r/226546 (https://phabricator.wikimedia.org/T104919) (owner: 10Tim Landscheidt) [23:09:16] <YuviPanda> Coren: looks mostly ok to me - one minor nit in earlier patchset (should be ContextError or something, rather than very broad 'RuntimeError') [23:09:48] <YuviPanda> Coren: if the systemd units are taking time that's ok - let's get this merged and run it manually once or twice, and I / others can help with the systemd units. [23:09:49] <ori> !log T84842: Requests to thumb_handler.php/.* don't match the ProxyPass rule and get handled by Zend instead. To see how HHVM actually handles these requests, I'm disabling Puppet on mw1153 and dropping the '$' anchor from the ProxyPass rules. [23:09:55] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:13] <grrrit-wm> (03CR) 10Catrope: [C: 032] Get rid of default=wikipedia assumptions in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225287 (https://phabricator.wikimedia.org/T104088) (owner: 10Alex Monk) [23:10:25] <Krenair> did you sync the first one RoanKattouw? [23:10:35] <RoanKattouw> Not yet [23:10:36] <Krenair> doing them all at once? [23:10:41] <RoanKattouw> I was hoping to just +2 them all in a row [23:10:41] <grrrit-wm> (03Merged) 10jenkins-bot: Get rid of default=wikipedia assumptions in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225287 (https://phabricator.wikimedia.org/T104088) (owner: 10Alex Monk) [23:10:42] <Krenair> ok [23:10:44] <RoanKattouw> But then that one was complicated [23:11:49] <ori> !log Restarting Apache on mw1153 [23:11:54] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:05] <grrrit-wm> (03CR) 10Catrope: [C: 032] Set ilwikimedia, noboard_chapterswikimedia and arbcom_dewiki's logos to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226471 (owner: 10Alex Monk) [23:12:11] <grrrit-wm> (03Merged) 10jenkins-bot: Set ilwikimedia, noboard_chapterswikimedia and arbcom_dewiki's logos to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226471 (owner: 10Alex Monk) [23:12:13] <grrrit-wm> (03CR) 10Catrope: [C: 032] Enable Flow on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226608 (owner: 10Catrope) [23:12:37] <grrrit-wm> (03Merged) 10jenkins-bot: Enable Flow on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226608 (owner: 10Catrope) [23:14:11] <logmsgbot> !log catrope Synchronized w/static/images/: SWAT (duration: 00m 12s) [23:14:18] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:48] <logmsgbot> !log catrope Synchronized wmf-config/: SWAT (duration: 00m 11s) [23:14:54] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:08] <logmsgbot> !log catrope Synchronized flow.dblist: Enable Flow on viwiki (duration: 00m 12s) [23:16:14] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:00] <Krenair> Looks ok.. [23:19:32] <grrrit-wm> (03CR) 10Tim Landscheidt: cassandra: Fix strict puppet-lint warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226459 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [23:20:56] <grrrit-wm> (03PS2) 10Yuvipanda: cassandra: Fix strict puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/226459 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [23:21:04] <grrrit-wm> (03CR) 10Yuvipanda: [C: 032 V: 032] cassandra: Fix strict puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/226459 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [23:27:38] <grrrit-wm> (03PS3) 10Niedzielski: WIP: Add Android emulation prerequisites [puppet] - 10https://gerrit.wikimedia.org/r/226237 (https://phabricator.wikimedia.org/T62720) [23:27:59] <wikibugs> 6operations, 7Varnish: Figure out purging of static logos for updates - https://phabricator.wikimedia.org/T106620#1477378 (10Krenair) Same issue for default.png now [23:28:26] <grrrit-wm> (03CR) 10Niedzielski: "Rebased. Thanks! Will do!" [puppet] - 10https://gerrit.wikimedia.org/r/226237 (https://phabricator.wikimedia.org/T62720) (owner: 10Niedzielski) [23:30:30] <niedzielski> Hey all! I don't know anything about Puppet but i have a patch for the Android Jenkins instance that we believe is good to go. Who would be a good POC to add to the Gerrit patch review? [23:30:40] <logmsgbot> !log catrope Synchronized php-1.26wmf14/extensions/CirrusSearch: SWAT (duration: 00m 13s) [23:30:48] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:53] <logmsgbot> !log catrope Synchronized php-1.26wmf14/extensions/WikimediaEvents: SWAT (duration: 00m 12s) [23:31:00] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:31:06] <logmsgbot> !log catrope Synchronized php-1.26wmf15/extensions/CirrusSearch: SWAT (duration: 00m 12s) [23:31:11] <niedzielski> Oh, and for reference, here's the patch https://gerrit.wikimedia.org/r/#/c/226237/. [23:31:13] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:31:14] <icinga-wm> PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 13 failures [23:31:18] <logmsgbot> !log catrope Synchronized php-1.26wmf15/extensions/WikimediaEvents: SWAT (duration: 00m 12s) [23:31:24] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:31:46] <Krenair> niedzielski, you want someone with puppet, android, and jenkins expertise? :/ [23:31:57] <niedzielski> Krenair: well, we verified the android bits [23:32:10] <ebernhardson> RoanKattouw: executor patch looks fine [23:32:13] <niedzielski> Krenair: we think the jenkins bits are good to go too [23:32:16] <ebernhardson> looking at EL now for the other [23:32:43] <Krenair> so really you want someone from ops? [23:33:00] <Krenair> who can actually merge the puppet patch into operations/puppet? [23:33:05] <ebernhardson> Krenair: ops [23:33:14] <niedzielski> Krenair: did i come to the right place? i thought ops was short for operations [23:33:14] <icinga-wm> PROBLEM - puppet last run on praseodymium is CRITICAL puppet fail [23:33:17] <ebernhardson> oh, wasn't a question :) [23:33:37] <Krenair> niedzielski, yep! [23:34:22] <YuviPanda> niedzielski: Krenair I'm happy merging it, but am curious wtf gerrit did with PS 1 and 2 in https://gerrit.wikimedia.org/r/#/c/226237/3 [23:34:30] <Krenair> niedzielski, if you had gone to -ops, then you'd be in the wrong place, confusingly enough :) [23:34:33] <YuviPanda> it starts at PS3... [23:34:46] <Krenair> YuviPanda, probably gerrit drafts [23:34:48] <niedzielski> YuviPanda: first two patches were drafts [23:34:52] <YuviPanda> ah, I see [23:34:55] <icinga-wm> PROBLEM - puppet last run on db2051 is CRITICAL puppet fail [23:35:05] <YuviPanda> niedzielski: so what was > Rebased. Thanks! Will do! [23:35:07] <YuviPanda> in response to? [23:35:14] <Krenair> you can still pull the actual contents of those drafts via several means [23:35:28] <Krenair> I think it's well known at this point [23:35:46] <niedzielski> YuviPanda: rebased means "rebased onto destination branch", production, in this case [23:35:56] <niedzielski> or that's what i intended it to mean [23:36:07] <YuviPanda> niedzielski: right, none of this is relavant, was just curious :) [23:36:18] <YuviPanda> niedzielski: the jenkins bits are different from here anyway. ok if I merge now? [23:36:20] <ebernhardson> RoanKattouw: looks like the EL code for wikimedia events is also working fine. thanks [23:36:25] <icinga-wm> PROBLEM - puppet last run on cerium is CRITICAL puppet fail [23:37:18] <niedzielski> YuviPanda: on my end, it would be great if you could merge it now. i'm not sure if there are any other concerns as i've not had to make changes to this repo previously [23:37:31] <grrrit-wm> (03PS4) 10Yuvipanda: Add Android emulation prerequisites [puppet] - 10https://gerrit.wikimedia.org/r/226237 (https://phabricator.wikimedia.org/T62720) (owner: 10Niedzielski) [23:37:48] <grrrit-wm> (03CR) 10Yuvipanda: [C: 032 V: 032] "Congratulations!" [puppet] - 10https://gerrit.wikimedia.org/r/226237 (https://phabricator.wikimedia.org/T62720) (owner: 10Niedzielski) [23:38:13] <YuviPanda> niedzielski: I have merged it after removing the 'WIP' tag [23:38:20] <niedzielski> YuviPanda Krenair ebernhardson: \o\ /o/ \o/ [23:38:23] <niedzielski> thanks! [23:38:25] <icinga-wm> PROBLEM - puppet last run on xenon is CRITICAL puppet fail [23:38:29] * Krenair did nothing [23:38:40] <Krenair> but you're welcome anyways :p [23:40:16] <YuviPanda> niedzielski: :) [23:40:34] <icinga-wm> PROBLEM - puppet last run on restbase1003 is CRITICAL puppet fail [23:42:24] <icinga-wm> PROBLEM - puppet last run on restbase1004 is CRITICAL puppet fail [23:43:40] <YuviPanda> restbase failures seem to be me [23:44:58] <grrrit-wm> (03PS1) 10Yuvipanda: Revert "cassandra: Fix strict puppet-lint warnings" [puppet] - 10https://gerrit.wikimedia.org/r/226652 [23:45:05] <grrrit-wm> (03PS2) 10Yuvipanda: Revert "cassandra: Fix strict puppet-lint warnings" [puppet] - 10https://gerrit.wikimedia.org/r/226652 [23:45:12] <grrrit-wm> (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "cassandra: Fix strict puppet-lint warnings" [puppet] - 10https://gerrit.wikimedia.org/r/226652 (owner: 10Yuvipanda) [23:48:49] <grrrit-wm> (03Abandoned) 10Ori.livneh: Add role::cache::kafka::banner [puppet] - 10https://gerrit.wikimedia.org/r/226646 (owner: 10Ori.livneh) [23:50:05] <grrrit-wm> (03PS1) 10Ori.livneh: erbium: add a log for beacon/impression [puppet] - 10https://gerrit.wikimedia.org/r/226654 (https://phabricator.wikimedia.org/T106624) [23:50:21] <grrrit-wm> (03PS2) 10Ori.livneh: erbium: add a log for beacon/impression [puppet] - 10https://gerrit.wikimedia.org/r/226654 (https://phabricator.wikimedia.org/T106624) [23:50:33] <grrrit-wm> (03CR) 10Ori.livneh: [C: 032 V: 032] erbium: add a log for beacon/impression [puppet] - 10https://gerrit.wikimedia.org/r/226654 (https://phabricator.wikimedia.org/T106624) (owner: 10Ori.livneh) [23:50:34] <icinga-wm> RECOVERY - puppet last run on restbase1003 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [23:57:04] <icinga-wm> RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:59:14] <icinga-wm> RECOVERY - puppet last run on praseodymium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures