[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160526T0000). [00:01:26] Completed at 2016-05-25 23:59:29+00:00. Copying LC files to /srv/mediawiki-staging [00:01:30] 00:01:18 Updated 392 JSON file(s) in /srv/mediawiki-staging/php-1.28.0-wmf.3/cache/l10n [00:01:36] syncing [00:07:11] (03PS3) 10Dzahn: logging: move files/misc/demux.py to modules/udp2log [puppet] - 10https://gerrit.wikimedia.org/r/289353 [00:07:20] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/2922/" [puppet] - 10https://gerrit.wikimedia.org/r/289353 (owner: 10Dzahn) [00:08:08] PROBLEM - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is CRITICAL: Connection refused [00:08:43] ^^^ got it [00:09:20] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is CRITICAL: Connection refused eevans Node is bootstrapping. - The acknowledgement expires at: 2016-05-28 00:08:58. [00:09:37] :) [00:15:01] (03PS1) 10Eevans: filter out new metrics [puppet] - 10https://gerrit.wikimedia.org/r/290860 [00:16:49] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2328751 (10Gilles) [00:17:35] !log dereckson@tin scap sync-l10n completed (1.28.0-wmf.3) (duration: 16m 15s) [00:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:08] * Dereckson soupire [00:18:34] bd808: I've a lot of permission denied like: [00:18:37] aywikibooks: [Thu May 26 00:18:26 2016] [hphp] [5591:7f2ad124f100:0:000001] [] [00:18:40] aywikibooks: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 [00:18:50] yeah. I think those are ok [00:18:55] k [00:19:07] We should open a bug and make sure though [00:19:28] something is running as the wrong user. those should all be owned by www-data [00:20:07] It may be just a quirk on tin from some other process [00:20:38] (03PS2) 10Dzahn: let chromium use jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/290347 [00:22:24] (03CR) 10Dzahn: [C: 032] let chromium use jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/290347 (owner: 10Dzahn) [00:22:36] Filled as https://phabricator.wikimedia.org/T136258 [00:22:42] 06Operations, 10Deployment-Systems, 03Scap3: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258#2328753 (10Dereckson) [00:27:50] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu May 26 00:27:50 UTC 2016 (duration 10m 15s) [00:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:02] All done [00:29:03] [dereckson@tin ~]$ [00:31:39] bd808: https://commons.wikimedia.org/wiki/MediaWiki:Group-wmf-supportsafety-member: the string hasn't been imported, would it be possible this process updates only *existing strings* but doesn't import new strings? [00:32:09] We expected to get new strings from https://gerrit.wikimedia.org/r/#/c/290581/ [00:32:25] Dereckson: hmmm... I don't remember honestly [00:35:38] Dereckson: I'm not sure that wmf-messages is setup to actually read those keys -- https://phabricator.wikimedia.org/diffusion/EWME/browse/master/WikimediaMessages.hooks.php [00:35:53] that extension does weird stuff [00:35:55] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Hiera hierarchy hieradata/role/* is not applied on labs (eg deployment-prep) - https://phabricator.wikimedia.org/T136080#2322414 (10scfc) Is this task a duplicate of T120165? [00:37:17] bd808: k I'm backporting it to php-1.28.0-wmf.3 [00:37:48] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Hiera hierarchy hieradata/role/* is not applied on labs (eg deployment-prep) - https://phabricator.wikimedia.org/T136080#2322414 (10Dzahn) yes, i think that's a duplicate. a real "merge" of the content is still being missed for these cases [00:37:53] if the extension isn't actually grabbing them in the onMessageCacheGet hook that won't help [00:38:00] have you tested on beta cluster? [00:38:31] hmmm I thought these keys were for a special handling. [00:39:06] hey Krenair how did you deploy the new name for tne en.wiki extendedconfirmed group? [00:39:45] Dereckson: you may be right. the message for https://commons.wikimedia.org/wiki/MediaWiki:Group-wmf-researcher-member is there [00:40:00] Oh [00:40:03] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [00:40:05] so it worked, but it was only a cache issue [00:40:07] good news [00:40:18] that is an older message in the file [00:40:28] ah [00:40:41] 06Operations, 07Puppet, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#1847021 (10Dzahn) https://wikitech.wikimedia.org/wiki/Puppet_Hiera#Role-based_lookup It's that "the new parser function/keyword, called role" is something that we (Joe) made ourself and do... [00:40:45] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [00:41:10] * Dereckson checkes the deployments archive. [00:45:41] The last similar config changes with new messages, we didn't deploy WikimediaMessages, only merged to master [00:46:27] (03PS2) 10Dzahn: snapshot: one file per role class, move to modules/role [puppet] - 10https://gerrit.wikimedia.org/r/286165 [00:48:22] https://wikitech.wikimedia.org/wiki/Deployments/Archive/2015/01#deploycal-item-20150106T0000 [00:48:33] aude did a backport [00:49:03] so apparently yes, it's the right process [00:50:27] AaronSchulz: Use correct module name for stats in executeActionWithErrorHandling() [00:51:09] AaronSchulz: this is an undeployed change on wmf3 branch [00:51:28] https://gerrit.wikimedia.org/r/#/c/290836/ [00:51:30] I was starring at the WikimediaMessages thing that was in HEAD..origin/wmf [00:52:27] !log aaron@tin Synchronized php-1.28.0-wmf.3/includes/api/ApiMain.php: 01e68e966413c (duration: 00m 29s) [00:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:53:36] 06Operations, 10Deployment-Systems, 03Scap3: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258#2328799 (10bd808) These are temp files for the WM globals cache. The files in /tmp are being created with l10nupdate:l10nup... [00:57:51] !log dereckson@tin Synchronized php-1.28.0-wmf.3/extensions/WikimediaMessages/i18n/wikimedia: Add i18n messages for new Support and Safety group (duration: 00m 26s) [00:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:58:56] Okay now 2am l10nupdate l10n cache refresh should propagate the l10n key. [00:59:43] * foks pokes head in [01:03:02] foks: TL;DR for *existing* messages, we've a working job to propagate new changes, it seems it doesn't work for *NEW* message, I checked in deployment/server admin log how previous similar WikimediaMessages have been handled and did the same: cherry-pick to current wmf branch, sync i18n folder. In 30 minutes, l10nupdate will refresh the cache according the keys in files. [01:03:35] foks: so when l10nupdate will have finished for wmf3 (not wmf2) and reported here it's done, you can test at https://commons.wikimedia.org/wiki/MediaWiki:Group-wmf-supportsafety [01:03:57] Ah okay [01:21:27] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2328835 (10Jdlrobson) [01:23:20] (03PS3) 10Dzahn: snapshot: one file per role class, move to modules/role [puppet] - 10https://gerrit.wikimedia.org/r/286165 [01:24:20] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup Fundraising DB - https://phabricator.wikimedia.org/T136200#2328837 (10Jgreen) [01:26:24] (03PS1) 10BryanDavis: foreachwikiindblist: Fix sudo guard and cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/290863 (https://phabricator.wikimedia.org/T136258) [01:27:23] (03CR) 10jenkins-bot: [V: 04-1] foreachwikiindblist: Fix sudo guard and cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/290863 (https://phabricator.wikimedia.org/T136258) (owner: 10BryanDavis) [01:28:02] (03PS2) 10BryanDavis: foreachwikiindblist: Fix sudo guard and cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/290863 (https://phabricator.wikimedia.org/T136258) [01:28:45] (03PS4) 10Dzahn: snapshot: one file per role class, move to modules/role [puppet] - 10https://gerrit.wikimedia.org/r/286165 [01:29:48] (03PS5) 10Dzahn: snapshot: one file per role class, move to modules/role [puppet] - 10https://gerrit.wikimedia.org/r/286165 [01:30:18] (03CR) 10Dzahn: [C: 032] snapshot: one file per role class, move to modules/role [puppet] - 10https://gerrit.wikimedia.org/r/286165 (owner: 10Dzahn) [01:33:22] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [01:35:21] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [01:40:17] (03PS1) 10Dzahn: snapshot: follow-up to move role classes [puppet] - 10https://gerrit.wikimedia.org/r/290864 [01:40:32] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: puppet fail [01:42:25] (03PS2) 10Dzahn: snapshot: follow-up to move role classes [puppet] - 10https://gerrit.wikimedia.org/r/290864 [01:45:23] (03PS3) 10Dzahn: snapshot: follow-up to move role classes [puppet] - 10https://gerrit.wikimedia.org/r/290864 [01:45:32] (03CR) 10Dzahn: [C: 032] snapshot: follow-up to move role classes [puppet] - 10https://gerrit.wikimedia.org/r/290864 (owner: 10Dzahn) [01:47:22] !log mw1249 - restart hhvm [01:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:48:01] RECOVERY - Apache HTTP on mw1249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.064 second response time [01:48:11] RECOVERY - HHVM rendering on mw1249 is OK: HTTP OK: HTTP/1.1 200 OK - 67693 bytes in 0.222 second response time [01:49:33] (03CR) 10Dzahn: "checked on every single snapshot host.. after https://gerrit.wikimedia.org/r/#/c/290864/ all is unchanged" [puppet] - 10https://gerrit.wikimedia.org/r/286165 (owner: 10Dzahn) [01:50:21] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:51:09] (03CR) 10Dzahn: "i don't remember the WIP part meanwhile.. been too long" [software] - 10https://gerrit.wikimedia.org/r/276890 (owner: 10Dzahn) [01:55:30] (03PS2) 10Dzahn: syslog: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/286164 [01:55:51] (03CR) 10jenkins-bot: [V: 04-1] syslog: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/286164 (owner: 10Dzahn) [01:56:23] (03PS3) 10Dzahn: syslog: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/286164 [01:57:55] (03CR) 10Dzahn: [C: 032] syslog: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/286164 (owner: 10Dzahn) [02:01:05] 06Operations, 13Patch-For-Review: install font packages on all appservers, not just imagescalers (was: Install fonts-wqy-zenhei on all mediawiki app servers) - https://phabricator.wikimedia.org/T84777#2328857 (10Dzahn) @joe If "Timelines aren't rendered on image scalers. They're rendered on standard mediawiki... [02:24:39] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.2) (duration: 10m 33s) [02:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:56:18] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.3) (duration: 15m 49s) [02:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:45] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu May 26 03:05:45 UTC 2016 (duration 9m 27s) [03:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:12:12] (03PS1) 10BryanDavis: CommonSettings: cleanup temp cache file if rename fails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290867 (https://phabricator.wikimedia.org/T136258) [04:17:20] (03PS7) 10Ori.livneh: Convert mwgrep to use regexp by default [puppet] - 10https://gerrit.wikimedia.org/r/283107 (owner: 10EBernhardson) [04:17:31] (03CR) 10Ori.livneh: [C: 032 V: 032] Convert mwgrep to use regexp by default [puppet] - 10https://gerrit.wikimedia.org/r/283107 (owner: 10EBernhardson) [04:29:37] 06Operations, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258#2328938 (10bd808) Temporary files left by l10nupdate failures can be cleaned up with: ``` sudo -u l10... [04:43:01] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 633 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6476307 keys - replication_delay is 633 [04:47:15] 06Operations, 06Discovery, 10Maps, 10Tilerator, 03Discovery-Maps-Sprint: water_polygons import is broken - https://phabricator.wikimedia.org/T112831#1647374 (10Yurik) @maxsem, is this still relevant? [04:49:51] (03PS1) 10Ori.livneh: grafana: disable automatic update checking and external snapshots [puppet] - 10https://gerrit.wikimedia.org/r/290868 [04:50:53] (03CR) 10jenkins-bot: [V: 04-1] grafana: disable automatic update checking and external snapshots [puppet] - 10https://gerrit.wikimedia.org/r/290868 (owner: 10Ori.livneh) [04:51:16] (03PS2) 10Ori.livneh: grafana: disable automatic update checking and external snapshots [puppet] - 10https://gerrit.wikimedia.org/r/290868 [04:52:21] (03CR) 10jenkins-bot: [V: 04-1] grafana: disable automatic update checking and external snapshots [puppet] - 10https://gerrit.wikimedia.org/r/290868 (owner: 10Ori.livneh) [04:52:31] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6471457 keys - replication_delay is 0 [04:56:40] (03PS3) 10Ori.livneh: grafana: disable automatic update checking and external snapshots [puppet] - 10https://gerrit.wikimedia.org/r/290868 [05:02:02] (03PS1) 10Ori.livneh: grafana: add wmf branding [puppet] - 10https://gerrit.wikimedia.org/r/290869 [05:02:26] (03CR) 10Ori.livneh: [C: 032] grafana: disable automatic update checking and external snapshots [puppet] - 10https://gerrit.wikimedia.org/r/290868 (owner: 10Ori.livneh) [05:04:51] (03CR) 10Ori.livneh: [C: 032 V: 032] grafana: add wmf branding [puppet] - 10https://gerrit.wikimedia.org/r/290869 (owner: 10Ori.livneh) [05:32:50] PROBLEM - puppet last run on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:01] PROBLEM - configured eth on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:20] PROBLEM - dhclient process on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:21] PROBLEM - Disk space on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:40] PROBLEM - Check size of conntrack table on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:40] PROBLEM - salt-minion processes on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:34:00] PROBLEM - DPKG on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:34:10] nah, it's ok [05:34:20] <_joe_> yes, I was checking too [05:34:32] RECOVERY - puppet last run on planet2001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [05:34:42] <_joe_> load average a 20, but going down [05:34:51] RECOVERY - configured eth on planet2001 is OK: OK - interfaces up [05:34:57] <_joe_> the ganeti bug, for sure [05:35:01] RECOVERY - dhclient process on planet2001 is OK: PROCS OK: 0 processes with command name dhclient [05:35:11] RECOVERY - Disk space on planet2001 is OK: DISK OK [05:35:21] RECOVERY - Check size of conntrack table on planet2001 is OK: OK: nf_conntrack is 0 % full [05:35:21] RECOVERY - salt-minion processes on planet2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:35:38] yea, it doent get traffic and the crons are deactivated. this is just there for failover from planet1001 [05:35:41] RECOVERY - DPKG on planet2001 is OK: All packages OK [05:35:49] and ack @ ganeti [05:41:58] _joe_: for anytime later.. not urgent at all.. i wonder if there is a reason _not_ to install the font packaages we install on imagescalers just on all appservers. https://gerrit.wikimedia.org/r/#/c/231284/ and https://phabricator.wikimedia.org/T84777#2328857 [05:42:57] <_joe_> mutante: I don't see a reason not to, but had no time to think about it [05:43:02] <_joe_> I have seen your patch though [05:43:28] alright! thanks [05:50:14] <_joe_> !log starting upgrades of hhvm to newer libicu in codfw (T86096) [05:50:15] T86096: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096 [05:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:52:52] (03CR) 10Dzahn: "this should be fine but in networks.pp $mw_appserver_networks = ['208.80.152.0/22'] does not cover labtestweb2001.wikimedia.org has addres" [puppet] - 10https://gerrit.wikimedia.org/r/290348 (owner: 10Alex Monk) [05:59:37] (03PS1) 10Dzahn: udp2log: move icinga checks from ./files/ to module [puppet] - 10https://gerrit.wikimedia.org/r/290871 [06:05:25] (03PS1) 10Dzahn: udp2log: mv rolematcher.py PacketLossLogtailer.py to module [puppet] - 10https://gerrit.wikimedia.org/r/290872 [06:06:08] (03CR) 10Mobrovac: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/290786 (owner: 10Mobrovac) [06:11:21] (03PS1) 10Dzahn: move/copy ubuntu-cloud.key into openstack/swift modules [puppet] - 10https://gerrit.wikimedia.org/r/290874 [06:16:29] (03PS1) 10Dzahn: varnish: mv wikimedia_vcl, netmapper_upd to separate files [puppet] - 10https://gerrit.wikimedia.org/r/290875 [06:17:38] (03PS2) 10Dzahn: varnish: mv wikimedia_vcl, netmapper_upd to separate files [puppet] - 10https://gerrit.wikimedia.org/r/290875 [06:20:45] (03CR) 10jenkins-bot: [V: 04-1] udp2log: mv rolematcher.py PacketLossLogtailer.py to module [puppet] - 10https://gerrit.wikimedia.org/r/290872 (owner: 10Dzahn) [06:20:53] (03PS1) 10Dzahn: varnish: move errorpage.html from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/290876 [06:21:54] (03CR) 10Dzahn: "gotta love the PEP8 fail when you are just moving .py files from one place to another :p" [puppet] - 10https://gerrit.wikimedia.org/r/290872 (owner: 10Dzahn) [06:24:52] (03PS1) 10Dzahn: nagios: move check_command/config to own file [puppet] - 10https://gerrit.wikimedia.org/r/290877 [06:26:47] (03CR) 10Ori.livneh: [C: 031] "small cosmetic issue, lgtm otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290455 (owner: 10Filippo Giunchedi) [06:26:58] (03PS2) 10Dzahn: RESTBase: Remove purging config [puppet] - 10https://gerrit.wikimedia.org/r/290786 (owner: 10Mobrovac) [06:30:41] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:41] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:30] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:01] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:50] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:26] (03CR) 10jenkins-bot: [V: 04-1] varnish: move errorpage.html from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/290876 (owner: 10Dzahn) [06:38:42] (03CR) 10Ori.livneh: [C: 031] "Looks good; didn't test. Few comments inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [06:39:01] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:41:32] <_joe_> !log upgrading hhvm on the eqiad canaries, T86096 [06:41:34] T86096: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096 [06:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:43:03] (03CR) 10jenkins-bot: [V: 04-1] RESTBase: Remove purging config [puppet] - 10https://gerrit.wikimedia.org/r/290786 (owner: 10Mobrovac) [06:43:13] heh [06:43:50] does someone know why wikidata went back to wmf.2? [06:50:32] Nikerabbit: https://dpaste.de/GRsq/raw [06:50:56] ori: kthanks [06:51:31] (03PS3) 10Muehlenhoff: Only require firejail on trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/290723 [06:53:03] (03CR) 10Muehlenhoff: [C: 032 V: 032] Only require firejail on trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/290723 (owner: 10Muehlenhoff) [06:55:51] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:56:01] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:56:11] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:57:00] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:30] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:32] (03PS2) 10Ori.livneh: wmflib: allow require_package('g++') [puppet] - 10https://gerrit.wikimedia.org/r/290697 (owner: 10Hashar) [06:57:47] (03CR) 10Ori.livneh: [C: 032 V: 032] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/290697 (owner: 10Hashar) [06:58:01] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:39] PROBLEM - HHVM rendering on mw2061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:04:07] RECOVERY - HHVM rendering on mw2061 is OK: HTTP OK: HTTP/1.1 200 OK - 67709 bytes in 0.270 second response time [07:20:08] PROBLEM - puppet last run on mw2087 is CRITICAL: CRITICAL: Puppet has 1 failures [07:29:00] <_joe_> !log upgrading hhvm on the eqiad imagescalers, T86096 [07:29:01] T86096: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096 [07:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:31:06] !log elastic: updating cirrussearch warmers on eqiad and codfw [07:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:36:48] <_joe_> !log upgrading hhvm on eqiad jobrunners, tin + terbium (T86096) [07:36:49] T86096: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096 [07:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:37:17] (03PS2) 10Muehlenhoff: Add ferm rules for role::snapshot::dumper [puppet] - 10https://gerrit.wikimedia.org/r/290421 [07:38:39] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for role::snapshot::dumper [puppet] - 10https://gerrit.wikimedia.org/r/290421 (owner: 10Muehlenhoff) [07:40:21] (03PS3) 10Muehlenhoff: Enable base::firewall for new snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/290422 [07:42:43] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall for new snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/290422 (owner: 10Muehlenhoff) [07:43:36] !log restbase starting partial mobile-sections dump of enwiki for T135571 on restbase1009 [07:43:37] T135571: [BUG] [Content Service] Tapping random causes an unknown error sometimes - https://phabricator.wikimedia.org/T135571 [07:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:45:11] (03CR) 10Alexandros Kosiaris: [C: 032] Change Prop: Purge RESTBase re-renders [puppet] - 10https://gerrit.wikimedia.org/r/290748 (owner: 10Mobrovac) [07:45:17] (03PS3) 10Alexandros Kosiaris: Change Prop: Purge RESTBase re-renders [puppet] - 10https://gerrit.wikimedia.org/r/290748 (owner: 10Mobrovac) [07:45:22] (03CR) 10Alexandros Kosiaris: [V: 032] Change Prop: Purge RESTBase re-renders [puppet] - 10https://gerrit.wikimedia.org/r/290748 (owner: 10Mobrovac) [07:46:37] RECOVERY - puppet last run on mw2087 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:48:30] akosiaris: ran puppet on scb or should i? [07:48:46] (cp shouldn't be restarted, i'll do it) [07:49:34] I ran puppet though [07:49:37] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: puppet fail [07:50:09] <_joe_> !log upgrading hhvm on eqiad's api cluster, (T86096) [07:50:10] T86096: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096 [07:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:51:14] kk, restarting [07:58:27] PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 1 failures [07:59:05] akosiaris: hm, wait, something's wrong [07:59:08] * mobrovac investigating [08:00:14] akosiaris: nm, false alarm [08:00:17] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures [08:01:10] (03PS1) 10Muehlenhoff: Drop deployment-ssh rules from role::snapshot::dumper [puppet] - 10https://gerrit.wikimedia.org/r/290882 [08:02:30] akosiaris: ok, let's wait for 10 mins or so so that things settle down a bit and then continue with the rb part? [08:02:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] Drop deployment-ssh rules from role::snapshot::dumper [puppet] - 10https://gerrit.wikimedia.org/r/290882 (owner: 10Muehlenhoff) [08:05:27] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [08:06:07] moritzm: thank you for the firejail / gallium fix up yesterday :) [08:07:47] hashar: yw, I wasn't sure whether we run any CI tests on videoscaler-specific tasks? because if we do I'll also need to build firejail for precise since I'm in the process of moving that to use it [08:13:26] moritzm: for the MediaWiki PHPUnit tests we need a wide range of .deb packages [08:13:39] and the easiest way I found to ship those .deb on the CI box has been to include mediawiki::packages [08:13:48] which in turns includes a lot of different classes and packages [08:13:53] then [08:14:09] gallium is no more running such tests, the related puppet class needs a lot of cleanup [08:14:25] we still have Precise box for the old release, and I am not sure whether firejail would be needed there or not [08:15:33] <_joe_> !log upgrading hhvm on eqiad's appserver cluster, (T86096) [08:15:34] T86096: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096 [08:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:15:52] hashar: ok, if it turns out to be needed, I can still build it, just didn't want to waste too much time on a deprecated OS [08:16:16] moritzm: is firejail something similar to app armor profile? Ie you would run : firejail somenasty.sh ? [08:16:37] I would assume some $wg variable would be set to enable firejail [08:17:37] hashar: yes, $wgImageMagickConvertCommand will be set to a wrapper which invokes firejail and the actual imagemagick convert [08:19:13] I guess that is going to be done in operations/mediawiki-config/ which we do not use for tests [08:19:37] so the CI jobs would be stuck to whaqtever is defined in MediaWiki includes/DefaultSettings.php which is: includes/DefaultSettings.php:$wgImageMagickConvertCommand = '/usr/bin/convert'; [08:19:40] ie no firejail [08:20:02] eventually one day we might want to have some integration tests that runs tests relying on imagemagick with a firejail profile [08:20:15] maybe that can be added straight into mediawiki/core ;) [08:20:48] hashar: ok, great! [08:22:49] (03PS1) 10DCausse: Elastic: add support for network.host [puppet] - 10https://gerrit.wikimedia.org/r/290883 [08:23:50] (03CR) 10jenkins-bot: [V: 04-1] Elastic: add support for network.host [puppet] - 10https://gerrit.wikimedia.org/r/290883 (owner: 10DCausse) [08:25:30] (03PS2) 10DCausse: Elastic: add support for network.host [puppet] - 10https://gerrit.wikimedia.org/r/290883 [08:25:58] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:26:16] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:29:28] (03CR) 10Gehel: [C: 032] Elastic: add support for network.host [puppet] - 10https://gerrit.wikimedia.org/r/290883 (owner: 10DCausse) [08:30:47] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [08:35:59] (03PS1) 10Alexandros Kosiaris: wikistats: Fix pplint error in wikistats::db [puppet] - 10https://gerrit.wikimedia.org/r/290885 [08:36:37] !log installing openssh security updates on trusty systems [08:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:38:10] (03CR) 10Alexandros Kosiaris: [C: 032] wikistats: Fix pplint error in wikistats::db [puppet] - 10https://gerrit.wikimedia.org/r/290885 (owner: 10Alexandros Kosiaris) [08:39:20] (03PS3) 10Alexandros Kosiaris: RESTBase: Remove purging config [puppet] - 10https://gerrit.wikimedia.org/r/290786 (owner: 10Mobrovac) [08:40:23] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: Remove purging config [puppet] - 10https://gerrit.wikimedia.org/r/290786 (owner: 10Mobrovac) [08:43:47] (03PS1) 10DCausse: Cirrus: disable the safeifier in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290886 [08:45:27] !log powercycling snapshot1004 (stuck after reboot) [08:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:46:51] (03CR) 10Gehel: [C: 032] Cirrus: disable the safeifier in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290886 (owner: 10DCausse) [08:51:02] 06Operations, 10ops-eqiad, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2329295 (10elukey) I was able to partition the new disk with ext4, but it has appeared under /dev/sda rather than /dev/sdd. Quick recap about the analytics config from https... [08:53:04] !log dcausse@tin Synchronized wmf-config/CirrusSearch-labs.php: Cirrus: disable the safeifier in labs (duration: 02m 36s) [08:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:54:34] PROBLEM - puppet last run on mw2130 is CRITICAL: CRITICAL: Puppet has 1 failures [08:55:34] !log restbase deployment start of bd38b1b [08:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:56:30] <_joe_> oh you're deploying restbase [08:56:40] yes [08:56:41] <_joe_> I was tailing pybal logs and saw rb hosts failing [08:56:42] <_joe_> :P [08:56:43] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:56:44] PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: Puppet has 1 failures [08:56:48] hehe [08:58:06] (03PS1) 10Hashar: rsync: allow extra settings in rsyncd.conf [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) [08:58:08] !log dcausse@tin Synchronized wmf-config/CirrusSearch-labs.php: Cirrus: disable the safeifier in labs (duration: 00m 25s) [08:58:08] (03PS1) 10Hashar: contint: disable DNS lookup for castor rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/290896 (https://phabricator.wikimedia.org/T136276) [08:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:58:24] (03CR) 10Hashar: [C: 04-1] "untested" [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [08:58:30] (03CR) 10Hashar: [C: 04-1] "untested" [puppet] - 10https://gerrit.wikimedia.org/r/290896 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [08:59:15] (03PS5) 10Filippo Giunchedi: graphite: split uwsgi logs to separate files [puppet] - 10https://gerrit.wikimedia.org/r/290455 [09:00:36] (03PS6) 10Filippo Giunchedi: graphite: split uwsgi logs to separate files [puppet] - 10https://gerrit.wikimedia.org/r/290455 [09:00:43] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: split uwsgi logs to separate files [puppet] - 10https://gerrit.wikimedia.org/r/290455 (owner: 10Filippo Giunchedi) [09:05:40] !log restbase deployment end of bd38b1b [09:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:07:20] !log converting user table on labswiki to utf8 [09:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:07:48] (03PS1) 10Filippo Giunchedi: uwsgi: use @basename not @title in syslog [puppet] - 10https://gerrit.wikimedia.org/r/290899 [09:09:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] uwsgi: use @basename not @title in syslog [puppet] - 10https://gerrit.wikimedia.org/r/290899 (owner: 10Filippo Giunchedi) [09:12:32] 06Operations, 10ops-eqiad, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2329330 (10elukey) Yes single disk raid0 virtual drive seems to be the way: ``` elukey@analytics1047:~$ sudo megacli -LDInfo -L2 -a0 Adapter 0 -- Virtual Drive Information:... [09:13:28] (03PS5) 10Filippo Giunchedi: service::node: Prepare for scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/290490 (owner: 10Mobrovac) [09:13:36] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] service::node: Prepare for scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/290490 (owner: 10Mobrovac) [09:15:15] mobrovac: ^ [09:15:23] tgr, can you check horizon log? [09:15:30] *logging [09:15:32] thnx godog! [09:15:42] np [09:15:47] godog: i'll run puppet on scb [09:17:00] (03CR) 10jenkins-bot: [V: 04-1] contint: disable DNS lookup for castor rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/290896 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [09:17:14] jynus: I still get "An error occurred authenticating. Please try again later. [09:18:15] can I copy that^ to the ticket? [09:18:34] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Puppet has 1 failures [09:19:35] PROBLEM - DPKG on mw2152 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:20:34] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:21:13] RECOVERY - puppet last run on mw2130 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:21:24] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:21:34] RECOVERY - DPKG on mw2152 is OK: All packages OK [09:25:23] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Puppet has 4 failures [09:28:11] <_joe_> !log all traffic serving appservers are now running with libicu52 (T86096) [09:28:12] T86096: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096 [09:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:00] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2329384 (10fgiunchedi) pillow 3.2.0-2~bpo8+1 uploaded to jessie-backports, should appear in the next few days [09:34:47] (03CR) 10Filippo Giunchedi: [C: 04-1] "superceded by I7be2777 I think" [puppet] - 10https://gerrit.wikimedia.org/r/290797 (https://phabricator.wikimedia.org/T136206) (owner: 10Dzahn) [09:39:53] (03PS3) 10Filippo Giunchedi: assign 'c' IPs for restbase100[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/290797 (https://phabricator.wikimedia.org/T136206) (owner: 10Dzahn) [09:42:10] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "nevermind, looks like this came first, amended to comment the hosts, now supercedes I7be2777f" [puppet] - 10https://gerrit.wikimedia.org/r/290797 (https://phabricator.wikimedia.org/T136206) (owner: 10Dzahn) [09:42:28] (03PS4) 10Filippo Giunchedi: assign 'c' IPs for restbase100[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/290797 (https://phabricator.wikimedia.org/T136206) (owner: 10Dzahn) [09:42:33] (03CR) 10Filippo Giunchedi: [V: 032] assign 'c' IPs for restbase100[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/290797 (https://phabricator.wikimedia.org/T136206) (owner: 10Dzahn) [09:43:26] (03Abandoned) 10Filippo Giunchedi: stub out missing 'c' instances [puppet] - 10https://gerrit.wikimedia.org/r/290800 (https://phabricator.wikimedia.org/T136206) (owner: 10Eevans) [09:45:12] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2420 [09:49:53] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:50:12] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 1890889 Threads: 2 Questions: 34677885 Slow queries: 11238 Opens: 1151 Flush tables: 2 Open tables: 577 Queries per second avg: 18.339 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 300 [09:54:59] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, other cassandra clusters might be interested in the change too?" [puppet] - 10https://gerrit.wikimedia.org/r/290860 (owner: 10Eevans) [09:58:02] (03Abandoned) 10Jcrespo: Remove dns entries for es2005-es2010 [dns] - 10https://gerrit.wikimedia.org/r/287645 (https://phabricator.wikimedia.org/T134755) (owner: 10Jcrespo) [09:58:43] <_joe_> !log starting updateCollations.php forced run on all wikis with uca category collation [09:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:01:15] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T135975#2329435 (10MoritzMuehlenhoff) a:03Papaul [10:09:14] PROBLEM - Host pay-lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [10:09:19] PROBLEM - Host payments2002 is DOWN: PING CRITICAL - Packet loss = 100% [10:09:52] <_joe_> srx again? [10:09:59] looking [10:10:44] I checked no user impact [10:10:46] it's responding to pings but I 've not a shell yet [10:11:26] PROBLEM - Host alnitak is DOWN: PING CRITICAL - Packet loss = 100% [10:11:31] PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100% [10:11:42] yeah, it's almost definitely the pfw [10:12:16] goddammit [10:12:20] right when faidon was ranting to me about juniper [10:13:09] -rw-rw---- 1 root wheel 0 May 26 10:06 /var/tmp/flowd_octeon_hm.core.0.gz [10:13:14] 0-byte coredump again [10:13:19] yay [10:13:36] (03PS1) 10Muehlenhoff: Provide a wrapper to invoke convert using firejail [puppet] - 10https://gerrit.wikimedia.org/r/290909 [10:14:52] plenty of available space too [10:14:58] I ran a storage cleanup last time around [10:15:18] RECOVERY - Host pay-lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 34.07 ms [10:15:23] RECOVERY - Host alnitak is UP: PING OK - Packet loss = 0%, RTA = 34.11 ms [10:15:28] RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 33.41 ms [10:15:35] RECOVERY - Host payments2002 is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [10:16:13] (03Abandoned) 10Muehlenhoff: WIP: Use firejail in image scaling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288390 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff) [10:16:37] I can open a case to ask juniper what the fuck is with 0-byte coredumps [10:16:46] JSRPD_HA_CONTROL_LINK_DOWN: HA control link monitor status is marked down [10:17:33] 10:05:49 ^ [10:29:46] (03CR) 10jenkins-bot: [V: 04-1] Provide a wrapper to invoke convert using firejail [puppet] - 10https://gerrit.wikimedia.org/r/290909 (owner: 10Muehlenhoff) [10:37:55] RECOVERY - cassandra-c CQL 10.64.32.204:9042 on restbase1012 is OK: TCP OK - 0.001 second response time on port 9042 [10:38:30] 06Operations, 07Puppet, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#2329486 (10hashar) [10:38:49] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Hiera hierarchy hieradata/role/* is not applied on labs (eg deployment-prep) - https://phabricator.wikimedia.org/T136080#2322414 (10hashar) [10:38:50] 06Operations, 07Puppet, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#1847021 (10hashar) [10:39:12] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Hiera hierarchy hieradata/role/* is not applied on labs (eg deployment-prep) - https://phabricator.wikimedia.org/T136080#2322414 (10hashar) Thanks @scfc marked this as a duplicate of T120165. I have copy pasted my extended task description there. [10:39:44] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#1847021 (10hashar) [10:44:24] (03PS3) 10Muehlenhoff: Provide a firejail profile for the image scalers [puppet] - 10https://gerrit.wikimedia.org/r/290696 (https://phabricator.wikimedia.org/T135111) [10:44:51] (03PS7) 10Filippo Giunchedi: prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) [10:44:52] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#2329534 (10Joe) The role keyword is used in production to refer to large groups of hosts; we DEFINITELY don't want to have role lookups in labs for the same... [10:44:58] (03CR) 10Filippo Giunchedi: prometheus: add server support (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [10:45:13] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [10:46:02] (03PS2) 10Muehlenhoff: Provide a wrapper to invoke convert using firejail [puppet] - 10https://gerrit.wikimedia.org/r/290909 [10:47:58] (03PS8) 10Filippo Giunchedi: prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) [10:49:56] (03CR) 10Filippo Giunchedi: [C: 031] Monitoring: Install vendor specific RAID tool [puppet] - 10https://gerrit.wikimedia.org/r/290717 (https://phabricator.wikimedia.org/T97998) (owner: 10Volans) [10:52:45] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 669 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6494536 keys - replication_delay is 669 [10:53:29] 06Operations, 10RESTBase-Cassandra, 13Patch-For-Review: better cassandra process checks - https://phabricator.wikimedia.org/T108306#2329544 (10fgiunchedi) [10:53:31] 06Operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#2329546 (10fgiunchedi) [10:53:35] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2329542 (10fgiunchedi) 05Open>03Resolved I agree this is complete, let's followup on T134016, resolving! [10:56:46] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6482710 keys - replication_delay is 0 [11:00:05] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2329554 (10Joe) Upgrade is done and scripts are running. Sadly, while some are exceeding my conservative evaluation of performance, frwiki is running around 1... [11:02:21] (03CR) 10jenkins-bot: [V: 04-1] Provide a wrapper to invoke convert using firejail [puppet] - 10https://gerrit.wikimedia.org/r/290909 (owner: 10Muehlenhoff) [11:09:24] 06Operations, 10ops-eqiad, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2329564 (10elukey) Fixed the issue with: ``` sudo megacli -PDMakeGood -PhysDrv '[32:2]' -Force -a0 sudo megacli -CfgLdAdd -r0 [32:2] -a0 ``` After the reboot the /dev/sdd di... [11:15:31] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2329584 (10Gilles) [11:15:49] (03CR) 10Alexandros Kosiaris: [C: 031] "+1" [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [11:20:14] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2067 [11:21:59] 06Operations, 10cassandra: Grafana bugginess; Graph scales sometimes off by an order of magnitude - https://phabricator.wikimedia.org/T121789#2329590 (10fgiunchedi) I think it comes from statsd recommendation on how to aggregate graphite metrics (https://github.com/etsy/statsd/blob/master/docs/graphite.md). Re... [11:24:36] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2329593 (10Gilles) The only remaining dependency, the upstream update of scikit-image is proving difficult. The package is massive, its packaging is complicated. It has... [11:25:14] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 1896589 Threads: 1 Questions: 34732919 Slow queries: 11404 Opens: 1153 Flush tables: 2 Open tables: 577 Queries per second avg: 18.313 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [11:27:04] 06Operations, 10cassandra: change graphite aggregation function for cassandra 'count' metrics - https://phabricator.wikimedia.org/T121789#2329595 (10fgiunchedi) [11:53:44] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 714 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6488123 keys - replication_delay is 714 [11:55:41] !log rebooting mx2001 for kernel update to Linux 4.4 [11:55:46] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6481588 keys - replication_delay is 0 [11:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:07:10] !log rolling reboot of restbase-test cluster [12:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:13:18] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2329671 (10Gilles) @fgiunchedi pointed out that pyssim has no license: https://github.com/jterrace/pyssim/issues/14 I've tracked down the original author of the code t... [12:24:44] (03PS1) 10Mobrovac: RESTBase: use the appropriate logger name [puppet] - 10https://gerrit.wikimedia.org/r/290922 (https://phabricator.wikimedia.org/T103124) [12:25:26] PROBLEM - Host es2017 is DOWN: PING CRITICAL - Packet loss = 100% [12:26:10] (03CR) 10Ppchelko: [C: 031] RESTBase: use the appropriate logger name [puppet] - 10https://gerrit.wikimedia.org/r/290922 (https://phabricator.wikimedia.org/T103124) (owner: 10Mobrovac) [12:26:56] (03CR) 10Mobrovac: "PCC confirms that's it - https://puppet-compiler.wmflabs.org/2931/ :)" [puppet] - 10https://gerrit.wikimedia.org/r/290922 (https://phabricator.wikimedia.org/T103124) (owner: 10Mobrovac) [12:29:06] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: use the appropriate logger name [puppet] - 10https://gerrit.wikimedia.org/r/290922 (https://phabricator.wikimedia.org/T103124) (owner: 10Mobrovac) [12:31:33] !log updating user table on labswiki to fix incorrect encoding T131630 [12:31:34] T131630: Tgr unable to login on Horizon - https://phabricator.wikimedia.org/T131630 [12:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:24] did es2017 crashed / lost network? [12:36:27] it seems on serial console like a kernel crash [12:39:47] !log powercycling es2017 [12:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:42:12] RECOVERY - Host es2017 is UP: PING OK - Packet loss = 0%, RTA = 34.47 ms [12:47:36] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2329742 (10elukey) [12:48:36] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2253604 (10elukey) a:05Cmjohnson>03elukey [13:05:10] 06Operations, 10RESTBase-Cassandra, 06Services, 10cassandra: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2329809 (10fgiunchedi) 05Open>03Resolved resolving as cassandra metrics are cleaned up now [13:10:43] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2329826 (10jcrespo) [13:12:31] 06Operations, 10Traffic, 13Patch-For-Review, 07Performance: Lots of Title::purgeExpiredRestriction from API DELETE FROM `page_restrictions` WHERE (pr_expiry < '20160517063108') without batching/throttling potentially causing lag on s5-api - https://phabricator.wikimedia.org/T135470#2329838 (10jcrespo) I wo... [13:12:49] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2329840 (10jcrespo) 05Open>03Resolved [13:26:41] 06Operations, 10Monitoring, 10netops, 03Scap3 (Scap3-Adoption-Phase1): Deploy libreNMS with scap3 - https://phabricator.wikimedia.org/T129136#2329902 (10faidon) a:03akosiaris [13:30:54] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2329937 (10elukey) Added some documentation in: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Swapping_broken_disk [13:35:13] 06Operations, 10ops-codfw, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2329955 (10jcrespo) p:05Normal>03High es2017 just (crashed?) at 12:25 today, I do not think that is unrelated. [13:35:24] 06Operations, 10ops-codfw, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2329958 (10jcrespo) 05stalled>03Open [13:37:20] bd808, hey you know mwscriptwikiset? [13:44:25] (03PS1) 10Elukey: Allow float result for int/int division in gmond's memcached module. [puppet] - 10https://gerrit.wikimedia.org/r/290933 [13:50:02] 06Operations, 10ops-codfw, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2330063 (10jcrespo) Nothing on syslog: ``` May 26 12:05:01 es2017 CRON[135137]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) May 26 12:15:01 es2017 CRON[135912]: (root) CMD (command... [13:52:22] !log restarting es2017 for kernel upgrade [13:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:05] (03PS1) 10Ottomata: Update otto's iterm2 shell integration script [puppet] - 10https://gerrit.wikimedia.org/r/290934 [13:56:40] (03CR) 10Ottomata: [C: 032] Update otto's iterm2 shell integration script [puppet] - 10https://gerrit.wikimedia.org/r/290934 (owner: 10Ottomata) [14:04:24] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2330100 (10jcrespo) [14:05:16] 06Operations: Apt mirror for Ubuntu Trusty hash sum mismatch - https://phabricator.wikimedia.org/T136307#2330101 (10hashar) [14:05:31] (03PS1) 10Alexandros Kosiaris: rsyslog::receiver: Increase log retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/290935 [14:09:32] 06Operations: Apt mirror for Ubuntu Trusty hash sum mismatch - https://phabricator.wikimedia.org/T136307#2330125 (10hashar) For what it is worth the mirror status page from 19 hours ago shows that Trusty is lagging behind https://launchpad.net/ubuntu/+mirror/wikimedia-archive {F4057150 size=full} [14:12:36] for fun and just in case someone here deals with typo-squatting, check out the random attacks at store.wikipeda.org [14:13:11] (asks for location, to install add-ons, etc. it's like a kitchen sink of simple hacks) [14:14:30] domains are handled by legal [14:14:35] 06Operations, 10media-storage, 13Patch-For-Review: swift capacity planning - https://phabricator.wikimedia.org/T1268#2330165 (10fgiunchedi) another factor for capacity swift capacity planning purposes is space allocated for different container types, most importantly thumbs and originals (69T vs 89T) [[ htt... [14:16:21] (03CR) 10Filippo Giunchedi: [C: 031] rsyslog::receiver: Increase log retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/290935 (owner: 10Alexandros Kosiaris) [14:21:38] 06Operations, 06Discovery, 10Maps: Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#2330195 (10Gehel) [14:21:40] (03PS1) 10Hashar: contint: let us vary localhost vhost unix user [puppet] - 10https://gerrit.wikimedia.org/r/290938 (https://phabricator.wikimedia.org/T136301) [14:22:09] 06Operations, 06Discovery, 10Maps: Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#2305596 (10Gehel) Karthoterian check could be an HTTP check on https://maps.wikimedia.org/osm-intl/0/0/0.png (or the equivalent on localhost) [14:23:06] (03CR) 10Alex Monk: "possibly, but not for scap itself AFAIK:" [puppet] - 10https://gerrit.wikimedia.org/r/290348 (owner: 10Alex Monk) [14:24:54] (03PS1) 10Ottomata: Add druid100[123] with just standard and base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/290940 (https://phabricator.wikimedia.org/T134275) [14:26:27] (03CR) 10Ottomata: [C: 032] Add druid100[123] with just standard and base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/290940 (https://phabricator.wikimedia.org/T134275) (owner: 10Ottomata) [14:27:30] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2330213 (10jcrespo) I think I got it: ``` "Normal","Mon Feb 08 2016 16:06:18","Log cleared." "Critical","Thu May 26 2016 12:22:06","CPU 2 has an internal error (IERR)." "Normal","Thu May 26 2... [14:28:31] robh, yt? [14:30:47] Krenair: I haven't looked at or used mwscriptwikiset, no [14:30:58] it's like foreachwikiindblist [14:31:00] but different [14:31:37] they do very similar things [14:32:05] (03CR) 10Rush: contint: let us vary localhost vhost unix user (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/290938 (https://phabricator.wikimedia.org/T136301) (owner: 10Hashar) [14:32:10] (03PS2) 10Rush: contint: let us vary localhost vhost unix user [puppet] - 10https://gerrit.wikimedia.org/r/290938 (https://phabricator.wikimedia.org/T136301) (owner: 10Hashar) [14:32:18] Looking now. Very very similar [14:32:34] chasemp: neat :) [14:32:44] Krenair: should we figure out how to combine them? [14:33:00] The difference seems to be the output prefixing mostly [14:34:13] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup/deploy 3 eqiad druid nodes - https://phabricator.wikimedia.org/T134275#2330218 (10Ottomata) > what does 'update install_server module' mean? Oh duh, it means stick in some MAC addies and some partman. Ok, I'm working on this. @Cmjohnson druid1003 do... [14:34:40] I think I wrote a task for it months ago [14:34:54] https://phabricator.wikimedia.org/T109798 [14:37:12] 06Operations, 06Labs, 06Project-Admins: Archive old Incident-* projects - https://phabricator.wikimedia.org/T134624#2330224 (10Danny_B) [14:37:32] 06Operations, 10ops-eqiad: Wipe wmf4727 - https://phabricator.wikimedia.org/T136309#2330226 (10akosiaris) [14:37:43] 06Operations, 10ops-eqiad: Wipe wmf4727 - https://phabricator.wikimedia.org/T136309#2330238 (10akosiaris) p:05Triage>03High [14:39:28] 06Operations, 10ops-eqiad: Wipe wmf4727 - https://phabricator.wikimedia.org/T136309#2330242 (10faidon) [14:39:30] 06Operations, 10ops-eqiad, 06DC-Ops: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2330244 (10faidon) [14:40:40] 06Operations, 10ops-eqiad, 06DC-Ops: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2045136 (10faidon) Folks, having a wmfNNNN server set up like that for such a long time and not being cleaned up properly is a problem for security and general maintenance re... [14:40:42] I'm looking at a alerting script, check_graphite, from operations-puppet.git. It looks like there was a temporary version created locally on a server that's still in use, and the issue prematurely marked as resolved [14:41:10] issue link https://phabricator.wikimedia.org/T116035 temp file mentioned here https://gerrit.wikimedia.org/r/#/c/255415/ [14:41:11] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2330264 (10jcrespo) a:05jcrespo>03Papaul For es2017, CPU seems to have failed: ``` CPU 2 Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz E5 2600 MHz IERR 10 ``` Memory show currently as ok, bu... [14:41:39] 06Operations, 06Labs, 06Project-Admins: Archive old Incident-* projects - https://phabricator.wikimedia.org/T134624#2330268 (10Danny_B) [14:44:06] ilevy: you probably want to ping ottomata considering that changeset^ [14:44:08] (03CR) 10Hashar: contint: let us vary localhost vhost unix user (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/290938 (https://phabricator.wikimedia.org/T136301) (owner: 10Hashar) [14:44:21] (03PS3) 10Hashar: contint: let us vary localhost vhost unix user [puppet] - 10https://gerrit.wikimedia.org/r/290938 (https://phabricator.wikimedia.org/T136301) [14:44:45] 06Operations, 06Labs, 06Project-Admins: Archive old Incident-* projects - https://phabricator.wikimedia.org/T134624#2330287 (10Danny_B) [14:47:31] (03CR) 10Rush: [C: 032] contint: let us vary localhost vhost unix user [puppet] - 10https://gerrit.wikimedia.org/r/290938 (https://phabricator.wikimedia.org/T136301) (owner: 10Hashar) [14:47:34] oof not remembering, but I see at least in cache/kafka/webrequest.pp , the conditional no longer is present [14:47:41] (even though the comment is) [14:47:58] I reopened the issue [14:48:15] (03PS1) 10Ottomata: Add netboot MACs and partman recipe for druid hosts [puppet] - 10https://gerrit.wikimedia.org/r/290944 (https://phabricator.wikimedia.org/T134275) [14:48:20] modules/nagios_common/files/check_commands/check_graphite.cfg still refers to the local script [14:48:24] so I think it might still be used [14:48:33] since --until isn't in the checked in version but it's used [14:48:53] I noticed because my company is also using this script I wanted --until support and noticed you guys added it in git and then reverted [14:49:04] hm thanks ilevy yeah looks like that one fell through the cracks [14:49:31] 06Operations: encrypt syslog traffic - https://phabricator.wikimedia.org/T136312#2330299 (10fgiunchedi) [14:51:02] (03PS1) 10Eevans: enable instance restbase1014-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/290945 (https://phabricator.wikimedia.org/T134016) [14:52:04] (03PS2) 10Ottomata: Add netboot MACs and partman recipe for druid hosts [puppet] - 10https://gerrit.wikimedia.org/r/290944 (https://phabricator.wikimedia.org/T134275) [14:52:36] Could I get someone to merge https://gerrit.wikimedia.org/r/#/c/290945/? It adds a new Cassandra instance for bootstrap. [14:52:47] 06Operations, 10vm-requests: eqiad/codfw: 1 VM request for prometheus - https://phabricator.wikimedia.org/T136313#2330317 (10fgiunchedi) [14:53:26] urandom: that host exists? [14:53:35] / this is safe for me to merge? [14:53:38] yup! [14:54:07] ottomata: those entries were all laid out ahead of time [14:54:26] (03CR) 10Ottomata: [C: 032] enable instance restbase1014-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/290945 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [14:54:28] uncommenting them just creates the config for that instance, so we can start the bootstrap [14:54:49] done. [14:54:53] ottomata: thank you! [14:54:58] (03PS3) 10Ottomata: Add netboot MACs and partman recipe for druid hosts [puppet] - 10https://gerrit.wikimedia.org/r/290944 (https://phabricator.wikimedia.org/T134275) [14:55:06] (03CR) 10Ottomata: [C: 032 V: 032] Add netboot MACs and partman recipe for druid hosts [puppet] - 10https://gerrit.wikimedia.org/r/290944 (https://phabricator.wikimedia.org/T134275) (owner: 10Ottomata) [14:57:19] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2330334 (10jcrespo) es2019 seems to had suffered the same cpu and memory errors: ``` MEM0001: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. 2016-04-22T14:48:59... [14:58:48] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2330347 (10elukey) p:05Triage>03High [14:59:04] 06Operations, 06Labs, 10Labs-Infrastructure: rcstream not working for wikitech wiki - https://phabricator.wikimedia.org/T136245#2330352 (10Krenair) I think we might need to change `@resolve(wikitech.wikimedia.org)` to `@resolve(wikitech.wikimedia.org, AAAA)` [15:00:04] anomie ostriches thcipriani marktraceur aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160526T1500). Please do the needful. [15:00:04] Pchelolo: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:02:08] 06Operations: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094#2330361 (10MoritzMuehlenhoff) So the problem occurs whenever /etc/sysctl.d/70-ferm_conntrack.conf is processed before ferm has been started (which loads the nf_conntrack kernel mod... [15:02:09] I can SWAT today. Pchelolo ping me when you're around. [15:02:19] thcipriani: I'm here [15:02:30] okie doke [15:06:21] !log Bootstrapping restbase1014-c.eqiad.wmnet : T134016 [15:06:22] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [15:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:32] PROBLEM - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is CRITICAL: Connection refused [15:06:43] expected; got this ^^ [15:07:33] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is CRITICAL: Connection refused eevans Node is bootstrapping. - The acknowledgement expires at: 2016-05-27 15:07:13. [15:07:35] !log thcipriani@tin Synchronized php-1.28.0-wmf.3/extensions/EventBus: SWAT: [[gerrit:290906|Use getPrefixedURL and getPrefixedDBkey instead of getText]] (duration: 00m 35s) [15:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:43] ^ Pchelolo check please [15:08:00] ottomata: here now, sup? [15:08:25] thcipriani: one moment [15:12:12] !log Starting cleanup of restbase1012-a.eqiad.wmnet [15:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:40] robh: hiyaa, was going to ask about some install server stuff, i got through it though! [15:12:50] except, druid1003's mgmt doesn't seem to be responding [15:13:00] druid1001 and 1002 are, and i think they are installing os now :o [15:13:07] hm, or maybe not [15:13:21] ahhh actually they are not, they just keep saying [15:13:23] May 26 15:13:19 carbon dhcpd: DHCPDISCOVER from 14:02:ec:06:8b:ec via 10.64.36.3: network 10.64.36.0/24: no free leases [15:14:26] thcipriani: all's good, thank you [15:14:39] Pchelolo: cool. Thanks for checking :) [15:16:41] robh: hmm maybe I configured somehtign wrong [15:16:42] i see [15:16:49] no free leases means either the dns isnt right [15:16:51] DHCPDISCOVER from 1c:98:ec:29:e2:98 via 10.64.5.3: network 10.64.5.0/24: no free leases for druid1001 [15:16:51] or the vlan isnt right [15:16:54] and [15:16:56] in dns [15:16:58] should these be eqiad or wikimedia.org? [15:17:03] it is 10.64.0.163 [15:17:03] eqiad [15:17:10] wmnet [15:17:12] ok, thats the right network, has carbon gotten the update? [15:17:16] yes [15:17:50] puppets disabled on carbon [15:17:56] but checking to see if it has the update [15:18:09] hm, i ran pupppet before I tried booting and saw my commit applied [15:18:13] thcipriani: I'd like to push https://gerrit.wikimedia.org/r/#/c/290710/ out in the SWAT [15:18:15] but, robh, that is the right network? [15:18:24] checking stuff now [15:18:35] i have a checklist, i dont skip around it ;] [15:18:41] bd808: are you fine with https://gerrit.wikimedia.org/r/#/c/290867/1 going out with SWAT? [15:18:43] 10.64.0.163 is not in 10.64.5.0/24, is it? [15:18:48] Krinkle: okie doke [15:18:49] haha ok robh [15:19:08] which host is this specifically we are looking at? [15:19:12] thcipriani: yeah if it looks good to you do it :) [15:19:13] you mentioned like 3 and then some output [15:19:14] ;] [15:19:27] druid1001 ? [15:19:27] robh, both druid1001 and druid1002 [15:19:33] ok, lets stick with druid1001 for now [15:19:34] k [15:19:39] May 26 15:19:33 carbon dhcpd: DHCPDISCOVER from 1c:98:ec:2a:a1:50 via 10.64.32.3: network 10.64.32.0/22: no free leases [15:19:42] is druid1001 [15:19:44] looking [15:19:57] but in dns it has 10.64.0.163 [15:20:19] 1C:98:EC:29:E2:98 [15:20:24] so those macs dont match [15:20:35] druid1001 has a mac address in the lease file of 1C:98:EC:29:E2:98 [15:20:41] wait [15:20:43] !log Update cxserver to b431aef [15:20:43] and you just pasteed a wholly different mac address [15:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:00] i think i pasted you a wrong log message... [15:21:19] this is what i put in the linux-host-entries file [15:21:19] 1C:98:EC:29:E2:98 [15:21:23] right [15:21:30] sorry robh [15:21:34] wrong log entry [15:21:34] this one [15:21:35] May 26 15:19:46 carbon dhcpd: DHCPDISCOVER from 1c:98:ec:29:e2:98 via 10.64.5.2: network 10.64.5.0/24: no free leases [15:21:37] is druid1001 [15:22:30] ok, so they match on mac [15:22:32] checking dns [15:24:05] (03PS2) 10Thcipriani: CommonSettings: cleanup temp cache file if rename fails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290867 (https://phabricator.wikimedia.org/T136258) (owner: 10BryanDavis) [15:24:41] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290867 (https://phabricator.wikimedia.org/T136258) (owner: 10BryanDavis) [15:25:02] well, the dns shows its in private1-a-eqiad [15:25:07] but your switch config has it in analytics1-a-eqiad [15:25:11] which would explain this [15:25:14] ottomata: ^ [15:25:17] (03Merged) 10jenkins-bot: CommonSettings: cleanup temp cache file if rename fails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290867 (https://phabricator.wikimedia.org/T136258) (owner: 10BryanDavis) [15:25:32] so you are trying to hand out a IP lease over a subnet that isnt allowed to do so for another subnet [15:25:48] Krinkle: ^ going to push that while waiting for jenkins, FYI [15:25:50] robh hm, ok, i didn't do the dns [15:25:56] this should be in analytics vlan [15:26:01] thcipriani: Ok [15:26:03] so, do I just need to pick a diff dns? [15:26:08] ottomata: yep, well, that explains the answer. you need to redo your dns to move it into the right part of the file [15:26:12] and your production dns entries will change [15:26:13] ok [15:26:17] cool, doing... [15:26:27] anytime its no free leases its usually a dns/vlan thing [15:26:36] just hard to diagnose without logging into switch stack =] [15:26:57] so yeah, accidental dns mismatch on setup (if i did dns, sorry about that ;) [15:27:08] i dont recall if i did, so many systems setup recently, heh. [15:27:38] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:290867|CommonSettings: cleanup temp cache file if rename fails]] (duration: 00m 30s) [15:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:23] haha [15:30:25] ja dunno either [15:30:35] (03PS1) 10Ottomata: Move druid entries into analytics vlans [dns] - 10https://gerrit.wikimedia.org/r/290954 (https://phabricator.wikimedia.org/T134275) [15:30:36] robh, look better? ^ [15:31:10] !log thcipriani@tin Synchronized php-1.28.0-wmf.3/resources/src/mediawiki.special/mediawiki.special.search.css: SWAT: [[gerrit:290710|Fix regression: text color in .mw-search-result-data (duration: 00m 27s) [15:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:33] ^ Krinkle sync'd [15:31:52] thcipriani: Thanks. Confirmed fix [15:32:59] (03CR) 10Ottomata: [C: 032] Move druid entries into analytics vlans [dns] - 10https://gerrit.wikimedia.org/r/290954 (https://phabricator.wikimedia.org/T134275) (owner: 10Ottomata) [15:34:39] robh, one more thing [15:34:46] i can't access druid1003.mgmt.eqiad.wmnet [15:34:48] so I can't find its mac [15:37:12] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdm3 is not accessible: Input/output error [15:37:55] hmm, robh. maybe i'm not patient enough. druid101 from carbon now resolves to 10.64.5.101 [15:37:59] that's what I want [15:38:11] still getting DHCPDISCOVER from 1c:98:ec:29:e2:98 via 10.64.5.2: network 10.64.5.0/24: no free leases [15:39:36] cmjohnson1: ahhh you are here! :) [15:41:10] ottomata: huh? [15:41:19] the dns is wrong in git [15:41:50] ottomata: so not sure how you mean it now resolves to another ip? [15:42:04] lets focus on just one machine at a time please [15:42:12] (im itentionally ignoring the issue on 2003) [15:42:14] sorry 1003 [15:42:30] ottomata: So are you saying now druid1001 gets a lease from carbon? [15:42:35] (that shouldnt be possible) [15:43:22] ok [15:43:24] no [15:43:25] robh [15:43:25] same with 1002 [15:43:26] what? [15:43:30] the dns is wrong in git? [15:43:33] i just changed it [15:43:36] oh, ok [15:43:44] @ottomata what's up? [15:43:45] https://gerrit.wikimedia.org/r/#/c/290954/ [15:43:46] lemme repull [15:44:08] ottomata: So I'm not sure what exactly step you are on. is druid1001 now getting a lease [15:44:09] ? [15:44:13] no [15:44:14] its not [15:44:21] i changed the dns so that it is now int he analytics vlan [15:44:24] well, puppet is disabled on carbon [15:44:24] and updated it on ns0 [15:44:32] does puppet need to run after a dns change? [15:44:34] hrmm, shoudlnt matter actually [15:44:36] ja [15:44:37] so [15:44:39] from carbon [15:44:43] except the old ip is likely cached [15:44:48] dig druid1001.eqiad.wment shows my change [15:45:03] you should hop on to any of the pdns recursors in eqiad and rec_control wipe-cache druid1001.eqiad.wmnet [15:45:12] it likely has the old entries cached, so carbon doesnt know to get the new one [15:45:16] ns0,1,2? [15:45:32] negative [15:45:35] looking them up now [15:45:55] !log applying schema change to s3 hosts echo wikis T135699 [15:45:56] T135699: Schema changes for Echo moderation - https://phabricator.wikimedia.org/T135699 [15:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:09] chromium|hydrogen [15:46:21] ottomata: so hop one iether one of those machines (grepped out of site.pp) chromium|hydrogen [15:46:30] and run rec_control wipe-cache druid1001.eqiad.wmnet [15:46:33] it likely has stuff to clear out [15:46:52] Alternatively, if you walked away from it in frustration, the dns cache would expire eventually ;D [15:46:54] ok [15:46:56] ha yeah [15:47:07] ok, done, let's see what happens... [15:47:18] then reboot it into pxe and (non)profit? [15:47:35] robh, it seems to be stuck in reboot from network cycle [15:47:38] so it keeps trying [15:50:15] not sure what you mean [15:50:30] dont we wnat it network booting right now? [15:53:05] yes [15:53:06] we do [15:53:11] i mean i don't have to go in and make it do it [15:53:31] hm, robh ya still the same [15:53:35] DHCPDISCOVER from 1c:98:ec:29:e2:98 via 10.64.5.2: network 10.64.5.0/24: no free leases [15:54:09] cmjohnson1: robh is helping me with druid dns/dhcp issues [15:54:11] but, also [15:54:15] druid1003's mgmt doesn't respond [15:55:46] im looking into 1001 settings now [15:55:52] ottomata: i'll be rebooting it likely [15:55:54] k [15:55:55] np [15:55:55] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1451756 (10Danmichaelo) >>! In T105794#2314347, @bd808 wrote: > @Steinsplitter reported to me on irc that >> for protocol relative urls in mwclient, scheme='htt... [15:55:56] and will hop on its console [15:56:00] i will get out of console [15:56:07] haha, actually [15:56:09] not sure how... [15:56:11] on these [15:56:19] OH! [15:56:19] i got it [15:56:23] not sure what i did [15:56:24] i think esc ) [15:56:38] k i'm out [15:57:27] robh@iron:~$ host druid1001.eqiad.wmnet [15:57:27] druid1001.eqiad.wmnet has address 10.64.0.163 [15:57:37] so some dns still has the other entry [15:57:43] you ran the wipe on chromium right? [15:57:46] lemme try hydrogen [15:58:13] 06Operations, 10cassandra, 13Patch-For-Review: Assign 'c' instance IPs for restbase100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T136206#2330620 (10Eevans) a:03Dzahn [15:58:26] (03CR) 10Luke081515: [C: 031] Changetags should be granted only to sysops and bots in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290680 (https://phabricator.wikimedia.org/T136187) (owner: 10Urbanecm) [15:58:28] robh@hydrogen:~$ sudo rec_control wipe-cache druid1001.eqiad.wmnet [15:58:28] wiped 1 records, 1 negative records [15:58:51] ottomata: you did that via sudo right? [15:59:06] i had negative records on both of the eqiad recurosors (hydrogen and chromium) but now wiped [15:59:07] 06Operations, 10cassandra, 13Patch-For-Review: Assign 'c' instance IPs for restbase100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T136206#2326678 (10Eevans) 05Open>03Resolved With https://gerrit.wikimedia.org/r/290797 merged, this is now complete I think; Thanks @Dzahn ! [15:59:12] rebooting and seeing if it works now [16:00:04] godog moritzm: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160526T1600). Please do the needful. [16:00:04] Dereckson: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:24] ok, its rebooting now [16:00:26] we shall see [16:00:43] ottomata: that was likely by bad in advising, i assumed you could clear on one of the recursors and it would take effect on the other [16:00:45] (03CR) 10Luke081515: [C: 031] Enable DynamicPageList extension on te.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285009 (https://phabricator.wikimedia.org/T104163) (owner: 10Urbanecm) [16:00:51] but perhaps it doesnt, as longas you ran it as sudo [16:00:57] it should have worked. [16:01:24] ottomata: if you ahve 1002 booting, kill it so it doesnt clutter our logs [16:01:30] i see a bunch of stuff hitting [16:02:14] robh, ja [16:02:23] [@chromium:~] $ sudo rec_control wipe-cache druid1001.eqiad.wmnet [16:02:23] wiped 2 records, 0 negative records [16:02:30] it works [16:02:36] ah ok [16:02:37] druid1001 is booting now into installer [16:02:40] (patch isn't mergeable right now, we lost Wikimedia CH server for that) [16:02:43] nice! [16:02:47] so yeah, turns out you have to killthe negative cache on both recursors in a given site [16:02:49] robh doing clear on both for 1002 [16:02:56] ottomata: sorry about that! [16:03:10] so yeah, just fyi, if you were installing in codfw, its two different servers ;] [16:03:17] but just grepping site.pp for recursor will show you them [16:03:19] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [16:03:32] thcipriani twentyafterfour ^ good to merge cc akosiaris [16:03:33] we'll see if partman works [16:03:41] i disconnected from 1001 [16:03:46] ok [16:03:46] its all yours (i left in the isntaller run) [16:04:30] ottomata: learn something new daily right? So that solves the no free leases issue. =] [16:04:59] ha, ja! thank you [16:05:03] am watching installers now [16:05:09] robh, any idea about 1003's mgmt? [16:05:40] godog: \o/ should be good from my perspective as long as all the keys are correct in secrets. I can test (mwdeploy at least) if it's merged [16:07:26] thcipriani: ack, I'll merge [16:07:48] ottomata: druid1003 mgmt is working [16:07:58] (03PS30) 10Filippo Giunchedi: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [16:08:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [16:08:37] (03CR) 10Dereckson: "Wikimedia CH server is now up again." [puppet] - 10https://gerrit.wikimedia.org/r/286147 (https://phabricator.wikimedia.org/T56780) (owner: 10Dereckson) [16:08:38] godog: moritzm: oh http://wikimediapakistan.org/ is up again, so we can merge it now ^ [16:09:49] thanks cmjohnson1_ i'm in [16:11:15] thcipriani: I'm rearming keyholder [16:11:24] PROBLEM - Host wmf4727-test is DOWN: PING CRITICAL - Packet loss = 100% [16:11:26] ack [16:13:22] thcipriani: good to go! [16:13:29] * thcipriani tests [16:14:20] (03PS1) 10Ottomata: Add druid1003's MAC to linxu-host-entries [puppet] - 10https://gerrit.wikimedia.org/r/290962 (https://phabricator.wikimedia.org/T134275) [16:14:36] godog: could you do a service restart of keyholder-proxy ? [16:14:46] it shouldn't require you to reload keys [16:14:50] it just reloads permissions [16:14:55] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [16:15:31] ^ blerg. I think I know what that's about. [16:15:48] thcipriani: sure, restarted the proxy just now [16:15:58] could it be lagging behind? it is armed now [16:16:00] godog: perfect. Working now [16:16:15] PROBLEM - puppet last run on aqs1003 is CRITICAL: CRITICAL: puppet fail [16:16:21] no, I think it's because we're now storing public keys in /etc/keyholder.d along with private keys. [16:16:33] cmjohnson1_: i can't reset boot order to disk [16:16:35] it keeps reinstalling [16:16:41] set /system1/bootconfig1/bootsource5 bootorder=5 [16:16:43] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2330749 (10Steinsplitter) >>! In T105794#2330598, @Danmichaelo wrote: >>>! In T105794#2314347, @bd808 wrote: >> @Steinsplitter reported to me on irc that >>> fo... [16:16:44] error_tag=INVALID TARGET [16:16:51] and the check just makes sure that all the files in /etc/keyholder.d are in the agent [16:16:58] ottomata...for 1003? [16:17:02] druid1003? [16:17:02] no, 1001 [16:17:05] probalby for all [16:17:14] thcipriani: hah, makes sense, thanks [16:17:34] /system1/bootconfig1 [16:17:34] Targets [16:17:34] bootsource1 [16:17:34] bootsource2 [16:17:34] bootsource3 [16:17:35] bootsource4 [16:17:37] Properties [16:17:37] oemhp_bootmode=Legacy [16:17:37] oemhp_secureboot=Not Available [16:17:37] oemhp_pendingbootmode=Legacy [16:17:38] no bootsource5 [16:17:45] ottomata: it's a setting in bios [16:18:11] godog: patch coming shortly for that [16:18:16] the HP comes default setting to use there UEFI and raid controller...in bios you have to change it. I don't think I got to it for them [16:18:42] fixing now [16:19:20] cmjohnson1_: i think it has bios [16:19:22] legacy bios [16:19:26] i got there and checked [16:19:28] it did netboot [16:19:56] i'm actually confused what is the current state on 1001 now [16:20:09] hmm, i take it back [16:20:13] it did boot from hdd [16:20:21] it's not jsut legacy [16:20:29] [ 0.112862] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330) [16:20:29] ? [16:20:30] there is another SATA setting that needs to be fixed [16:20:40] mdadm: No devices listed in conf file were found. [16:20:40] Gave up waiting for root device. Common problems: [16:20:40] - Boot args (cat /proc/cmdline) [16:20:40] - Check rootdelay= (did the system wait long enough?) [16:20:40] - Check root= (did the system wait for the right device?) [16:20:40] - Missing modules (cat /proc/modules; ls /dev) [16:20:40] ALERT! /dev/disk/by-uuid/076f6ec1-8f05-447b-9f13-accfca1a5ec1 does not exist. Dropping to a shell! [16:20:41] hm [16:20:41] ok [16:20:41] 06Operations, 10ops-codfw, 10ops-eqiad, 10vm-requests: eqiad/codfw: 1 VM request for prometheus - https://phabricator.wikimedia.org/T136313#2330768 (10Danny_B) [16:20:50] cmjohnson1_: i will wait for you to check [16:20:50] ? [16:21:15] PROBLEM - puppet last run on sca2002 is CRITICAL: CRITICAL: puppet fail [16:21:58] can't seem to get out of initramfs [16:22:05] PROBLEM - puppet last run on mw2113 is CRITICAL: CRITICAL: Puppet has 1 failures [16:22:30] (03PS1) 10Thcipriani: Do not include public keys in keyholder check [puppet] - 10https://gerrit.wikimedia.org/r/290966 [16:22:47] ^ godog should fix keyholder [16:23:01] er, keyholder checks rather [16:23:06] thcipriani: nice, taking a look now [16:23:25] ottomata: can you log out of the vsp for 1001/1002 plz [16:23:32] trying... [16:23:34] not really sure how [16:23:35] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [16:23:46] esc ( [16:23:54] ah [16:23:55] htank you [16:23:57] out [16:24:05] of 1001 [16:24:16] and 1002 out too [16:24:22] thx [16:24:35] godog: also sca2002/aqs1003 puppet run fails are probably this change, but I'm unsure what would be changing there. [16:24:39] not sure what the issue is 1003 has the right setup..i know robh ran into an issue like this with another HP [16:24:42] (03CR) 10Ottomata: [C: 032] Add druid1003's MAC to linxu-host-entries [puppet] - 10https://gerrit.wikimedia.org/r/290962 (https://phabricator.wikimedia.org/T134275) (owner: 10Ottomata) [16:24:43] not sure what fixed it now [16:24:54] thcipriani: yeah I was looking at that too, Error: Could not retrieve catalog from remote server: Error 400 on SERVER: secret(): invalid secret keyholder/deploy-service.pub at /etc/puppet/modules/scap/manifests/target.pp:83 on node sca2002.codfw.wmnet [16:25:02] ha, ok [16:25:34] sorry fixed it then [16:26:04] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:05] godog: hmm, either the file isn't in the secret module or it's unreadable by puppetmaster looks like [16:26:13] oh cmjohnson1_ 1003 seems fine now [16:26:16] i was about to get in [16:26:18] to mgmt [16:26:34] https://github.com/wikimedia/operations-puppet/blob/production/modules/wmflib/lib/puppet/parser/functions/secret.rb#L23 [16:26:36] am ready to netboot install it too [16:26:44] so, if you could check on 1001,1002,1003 to make sure bios settings are correct [16:26:48] then i will try again [16:26:55] PROBLEM - puppet last run on aqs1001 is CRITICAL: CRITICAL: puppet fail [16:26:58] 1003 is rebooting now if you wanna login [16:26:58] (or if you can get them installed, that is good too!) [16:27:03] to vps? [16:27:05] uh [16:27:08] yes [16:27:13] vsp i mean [16:27:20] ok [16:27:30] I do not see anything wrong with setup [16:27:50] oh ok [16:27:52] thcipriani: yeah the name in secret is deploy_service not deploy-service [16:28:02] thought you were saying ther ewas some legacy bios thing that was not right [16:28:24] i thought the SATA AHCI wasn't set...but I did do that..so not sure [16:28:42] trying to install 1002 now to see what it says [16:28:48] AHCI SATA Controller (v0.87) :) [16:28:50] k [16:28:52] yep [16:28:54] i'm watcying watching 1003 [16:28:55] that's correct [16:29:16] PROBLEM - puppet last run on sca1002 is CRITICAL: CRITICAL: puppet fail [16:29:31] cool 1003 is netbooting [16:29:59] godog: ah, I see what's happening. In keyholder::agent the keyname has anything \W replaced with _ whereas in scap::target that's not happening. [16:30:15] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: puppet fail [16:30:21] godog: I can patch as well. Sorry :( [16:30:25] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: puppet fail [16:30:50] thcipriani: np, yeah I think that's what's happening, odd it wasn't catched before though [16:30:55] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: puppet fail [16:31:27] mmmm [16:31:28] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: secret(): invalid secret keyholder/deploy-service.pub at /etc/puppet/modules/scap/manifests/target.pp:83 on node aqs1001.eqiad.wmnet [16:31:34] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: puppet fail [16:32:07] elukey: yeah, a few lines up in the backscroll [16:32:29] godog: ahhh sorry didn't see it, thanks :) [16:32:39] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T135975#2330837 (10Papaul) p:05Triage>03Normal [16:32:46] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2330840 (10Papaul) p:05Triage>03Normal [16:33:20] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2330841 (10mobrovac) [16:33:39] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission es2005-es2010 - https://phabricator.wikimedia.org/T134755#2330842 (10Papaul) p:05Triage>03Normal [16:34:14] (03PS1) 10Thcipriani: Fix key name in scap::target [puppet] - 10https://gerrit.wikimedia.org/r/290973 [16:34:36] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service, 13Patch-For-Review: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551#2330846 (10mobrovac) I think we can consider this resolved now? [16:34:55] PROBLEM - puppet last run on sca2001 is CRITICAL: CRITICAL: puppet fail [16:35:21] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2268779 (10mobrovac) From what I can tell all but the Parsoid issues have been dealt with. Should we resolve this? [16:36:08] hmm cmjohnson1_ Installation step failed │ [16:36:08] │ An installation step failed. You can try to run the failing item │ [16:36:08] │ again from the menu, or skip it and choose something else. The │ [16:36:08] │ failing step is: Select and install software │ [16:36:08] │ [16:36:11] oook....? [16:36:11] thcipriani: you should change the variable also in the secret() call, looks good otherwise [16:36:20] (03PS2) 10Thcipriani: Fix key name in scap::target [puppet] - 10https://gerrit.wikimedia.org/r/290973 [16:36:28] no indication as to why [16:36:39] ottomata: that is not h/w related [16:36:43] aye [16:36:45] something wrong with partman recipe most likely [16:36:51] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2330849 (10ema) When are we seeing those inconsistencies? Any specific timeframes? [16:36:54] ah ok [16:36:56] likely :) [16:37:15] ottomata: from the busybox installer shell, you can maybe find more details in /var/log/ , logs of the installer itself [16:37:24] i forget the exact path [16:37:26] busybox? [16:37:37] Execute a shell [16:37:37] ? [16:37:40] the shell you get when you "execute shell" from within the installer [16:37:43] yes [16:37:43] ah yes [16:37:53] godog: confirmed that patch fixes https://gerrit.wikimedia.org/r/#/c/290973/ [16:38:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Fix key name in scap::target [puppet] - 10https://gerrit.wikimedia.org/r/290973 (owner: 10Thcipriani) [16:38:34] ottomata: there should be one log for partman and one for the installer or so [16:38:38] thcipriani: yup thanks for the quick fix! [16:38:38] ja [16:39:22] godog: thank you for quick the merges, sorry for the rocky deploy. [16:40:05] PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: puppet fail [16:40:16] thcipriani: haha that's okay, no worries [16:40:33] puppet should be recovering soon [16:40:45] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: puppet fail [16:40:48] (03PS2) 10Filippo Giunchedi: Do not include public keys in keyholder check [puppet] - 10https://gerrit.wikimedia.org/r/290966 (owner: 10Thcipriani) [16:40:49] mutante: hmm, not sure what to look for in these logs [16:40:50] but [16:40:58] the partitions/md/lvm looks right [16:41:29] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Do not include public keys in keyholder check [puppet] - 10https://gerrit.wikimedia.org/r/290966 (owner: 10Thcipriani) [16:41:46] ottomata: hmm, yea, just to find out which was the last step before it failed [16:41:55] RECOVERY - puppet last run on sca2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:42:10] like the end of the installer log then [16:42:15] hm [16:42:18] hthings like [16:42:18] May 26 16:37:23 in-target: bind9-host : Depends: libbind9-90 (= 1:9.9.5.dfsg-9+deb8u6) but it is not going to be installed [16:42:25] May 26 16:37:23 in-target: E: Unable to correct problems, you have held broken packages. [16:42:29] May 26 16:37:23 in-target: rpcbind : Depends: libtirpc1 (>= 0.2.4-2~) but it is not installable [16:42:50] ja cause the step that failed was installing packages [16:42:57] i think partman and base OS was fine [16:43:17] RECOVERY - puppet last run on aqs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:43:19] yea, that would be after partman, package intsall, yea [16:43:43] ottomata: i had that happen to me a lot one week [16:43:46] and then they went away [16:43:46] oh ja? [16:43:48] ? [16:43:48] haha [16:43:52] and i never figured out why =P [16:43:52] bind9-host .. havent we seen this before [16:43:58] what rob said [16:44:01] actually i think bind9-host is ok [16:44:16] i think [16:44:16] May 26 16:37:23 in-target: rpcbind : Depends: libtirpc1 (>= 0.2.4-2~) but it is not installable [16:44:16] yea, but that dependency problem there [16:44:18] is the main prob [16:44:33] those others could be related too [16:44:34] hmm ja [16:44:35] the dependency issue during installs seemed to be a transient one that wasnt actual package issues [16:44:41] but that was before and who knows [16:44:47] aye, i tcan't reach out to network maybe or something? [16:44:55] apt [16:44:55] ? [16:44:58] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [16:45:12] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2330888 (10elukey) [16:45:18] ottomata on 1003 i see this http://p.defau.lt/?EOZ_yG38xNN2RkitoRs_nQ [16:45:22] thcipriani: looking good, mira rearmed as well [16:45:39] cmjohnson1_: on 1003? [16:45:44] i'm in installer still on 1003 [16:45:45] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2330335 (10elukey) >>! In T136314#2330849, @ema wrote: > When are we seeing those inconsistencies? Any specific timeframes? Need to query Hive and Hadoop... [16:45:45] in a shell [16:45:51] godog: \o/ awesome! Thanks for your help! [16:46:00] sorry 1002 [16:46:27] thcipriani: np! thanks twentyafterfour too [16:46:31] H [16:46:32] ah [16:46:37] yeah i think i got that on 1001 at some point [16:46:55] dunno what makes it boot into that, but i'm guessing it is that the installer didn't finish properly [16:46:57] but it did install the os [16:47:13] ottomata: did it succesfully install any other package before thta.. or can it just not install any package .. is what im wondering now [16:47:26] if the latter maybe it's just network/vlan/proxy [16:47:49] to get to the apt repo [16:48:24] Dereckson: saw your patch but I have to go shortly and can't make it today :( sorry about that [16:48:33] Hmm [16:49:56] May 26 16:37:21 in-target: E: Package 'laptop-detect' has no installation candidate [16:50:09] May 26 16:37:21 debconf: --> GET mirror/http/proxy [16:50:09] May 26 16:37:21 debconf: <-- 0 http://webproxy.eqiad.wmnet:8080 [16:50:27] May 26 16:37:22 in-target: E: Unable to locate package installation-report [16:50:38] May 26 16:37:22 in-target: E: Unable to locate package popularity-contest [16:51:02] mutante: i *think* I don't see any successful post core os package installations [16:51:20] 06Operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2330946 (10Cmjohnson) [16:51:21] hmmm.. maybe it cant talk to webproxy.eqiad from the new VLAN [16:51:22] 06Operations, 10ops-eqiad, 06DC-Ops: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2330943 (10Cmjohnson) 05Open>03Resolved Removed from puppet, salt and wiped disks. The error was mine, I for some reason didn't think it was ever installed. [16:51:50] maybe logs on webproxy.eqiad.wmnet then [16:52:22] would it say "Unable to locate package" if it really meant " i could not even ask"? [16:52:35] still unable to.. but a different kind [16:52:56] ja dunno [16:53:17] RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:54:17] ottomata: from where to where did it move.. network wise [16:55:19] godog: ack, no problem [16:55:28] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:55:59] mutante: it has never been installed before [16:56:03] but, it is in the analytics vlan [16:56:08] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:56:37] RECOVERY - puppet last run on sca1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:56:47] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:57:12] ottomata: ah, ok. and other servers in analytics vlan can use that webproxy just fine i assume. how about tail -f /var/log/squid3/access.log on carbon (webproxy) while you try it again [16:58:54] hm, mutante retrying just Select and Install software doesn't show anything there [16:58:57] well, when was the last insall in analytics vlan and was it for jessie? [16:58:57] it fails pretty quickly though [16:59:14] i dont recall what sysetm i had the error on [16:59:16] hmmm [16:59:17] robh, i think the recent aqs100[456] [16:59:18] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:59:19] and those are jessie [16:59:22] oh, but those [16:59:25] are not in analytics vlan [16:59:26] hm [16:59:31] hm [16:59:49] yeah, i am wondering if this is a vlan issue for apt/security/routing or something else, no idea [17:00:02] but id say for kicks try to install trusty and see if it has the error? [17:00:04] yurik gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160526T1700). [17:00:09] just to narrow down scope [17:00:13] it feels like double checking the vlan config would be good, yea [17:00:19] and what rob said .. [17:00:38] if it works for trusty and not jessie, then we know its likely NOT network routing policies [17:01:00] and then just a jessie config/package/soemthing related issue [17:01:05] heh, 'just' [17:01:18] RECOVERY - puppet last run on sca2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:04:39] hm [17:04:41] robh, but [17:04:44] jessie does install [17:04:53] the core os does [17:04:57] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.133, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [17:04:59] yes, but they arent the same packages [17:05:01] hm [17:05:03] ja [17:05:04] trusty/jessie [17:05:08] PROBLEM - Restbase root url on restbase1014 is CRITICAL: Connection refused [17:05:11] i'd like to veriyf that webproxy works from the shell [17:05:13] not sure how to do that [17:05:17] so we know that tftp works, but when it gets to install packages via the http proxy.. maybe not [17:05:23] so im just trying to determine if its a network security policy thing for that vlan, or a jessie package issue [17:05:34] trying to find a way... [17:05:37] if ubuntu apt works, then we know its not network [17:05:37] RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:05:52] since it has no info, seemed easy enough to just locally hack out the jessie line for dhcp [17:05:54] and reinstall [17:06:14] (which would rule out the network policy for apt issue since i cannot view or cannot make sense of them off the router ;) [17:06:36] my idea may not be valid, hence i share why i suggest =] [17:06:38] yea, try the trusty install, i have a feeling it might just work [17:06:44] k will try it, one sec [17:07:01] that would narrow it down to a jessie issue if it does, well, jessie with our apt/packages [17:07:06] what's with rb1014? [17:07:10] urandom: known ^ ? [17:07:18] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [17:07:18] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [17:07:18] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [17:07:18] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [17:07:19] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [17:07:19] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [17:07:29] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [17:07:29] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [17:07:38] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [17:07:38] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [17:07:58] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:08:08] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [17:08:08] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [17:08:23] uh [17:08:26] what is this [17:08:28] Was that expected? [17:08:32] robh: local hack on carbon? [17:08:32] nope [17:08:38] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [17:08:48] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [17:08:48] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [17:08:48] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [17:08:57] ottomata: halt puppet, open the /etc/dhcp/linux.hosts.blah and remove the two lines for jessie for the system you are installing [17:08:58] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [17:08:58] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [17:09:05] then reboot it into ubuntu, once installer loads, reenable puppet [17:09:08] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [17:09:08] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [17:09:18] (easlier than putting in a new patch just for a single reboot) [17:09:51] k [17:10:13] I think mysql crashed [17:10:23] and that is really bad news [17:10:25] (03PS1) 10Faidon Liambotis: Create raid module to hold RAID monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/290986 (https://phabricator.wikimedia.org/T84050) [17:10:26] cmjohnson1_: i still don't understand the boot order bootsource5 thing [17:10:27] (03PS1) 10Faidon Liambotis: raid: add HP's RAID tool to the list [puppet] - 10https://gerrit.wikimedia.org/r/290987 (https://phabricator.wikimedia.org/T97998) [17:10:29] (03PS1) 10Faidon Liambotis: raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) [17:10:34] status=2 [17:10:34] status_tag=COMMAND PROCESSING FAILED [17:10:34] error_tag=INVALID TARGET [17:10:37] ther eis no bootsource5 on these [17:10:40] for a tokudb host [17:10:47] mobrovac: nothing i am doing, no [17:10:49] godog, volans: ^^^ please review [17:10:51] (more will follow) [17:11:14] probably OOM [17:11:53] mobrovac: there is an instance bootstrapping there, but it hasn't started listening for connections yet [17:12:01] ottomata: I don't know either [17:12:27] (03CR) 10Dzahn: [C: 031] Apache: redirect pk.wikimedia.org to wikimediapakistan.org [puppet] - 10https://gerrit.wikimedia.org/r/286147 (https://phabricator.wikimedia.org/T56780) (owner: 10Dereckson) [17:12:30] (03CR) 10jenkins-bot: [V: 04-1] Create raid module to hold RAID monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/290986 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [17:12:37] (03CR) 10jenkins-bot: [V: 04-1] raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [17:12:42] blergh [17:13:33] (03PS2) 10Faidon Liambotis: Create raid module to hold RAID monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/290986 (https://phabricator.wikimedia.org/T84050) [17:13:35] (03PS2) 10Faidon Liambotis: raid: add HP's RAID tool to the list [puppet] - 10https://gerrit.wikimedia.org/r/290987 (https://phabricator.wikimedia.org/T97998) [17:13:37] (03PS2) 10Faidon Liambotis: raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) [17:16:01] aghhh [17:16:06] cmjohnson1_: my internet just died [17:16:08] lost my connection to the vsp [17:16:10] (03CR) 10jenkins-bot: [V: 04-1] raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [17:16:15] Virtual Serial Port is currently in use by another session. [17:16:18] how do I clear it? [17:16:31] on an hp? [17:16:32] oh you have nice docs... [17:16:36] somebody seemed to have stopped rb there urandom [17:16:36] stop /system1/oemhp_vsp1 [17:16:36] atp Hash Sum mismatch, nice [17:16:37] :) [17:16:37] wth? [17:17:02] (03CR) 10Dzahn: "disregard that, i didn't look at the netmask right." [puppet] - 10https://gerrit.wikimedia.org/r/290348 (owner: 10Alex Monk) [17:17:52] ottomata1: racadm racreset [17:17:57] mutante: this is a hp [17:18:05] i found it though [17:18:06] stop /system1/oemhp_vsp1 [17:18:06] ottomata1: just realized, nvm [17:18:09] :) [17:18:13] good to know [17:18:58] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [17:19:08] RECOVERY - Restbase root url on restbase1014 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.027 second response time [17:19:18] (03PS2) 10Dzahn: scap: add labtestwikitech to mediawiki-installation group [puppet] - 10https://gerrit.wikimedia.org/r/290348 (owner: 10Alex Monk) [17:20:14] jouncebot: next [17:20:14] In 1 hour(s) and 39 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160526T1900) [17:21:13] mobrovac: sorry, yeah, just came to the same conclusion [17:21:18] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:21:34] mobrovac: that it looked like it was just shutdown [17:21:38] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:21:58] PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:21:59] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:22:07] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:22:28] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:22:28] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:22:41] yes, yes, we knew with the first time you told us [17:22:48] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:22:58] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:24:11] we'll see if you come back [17:24:30] (03PS3) 10Faidon Liambotis: raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) [17:25:15] (03PS4) 10Faidon Liambotis: raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) [17:26:08] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2330335 (10Nuria) 1. On labs or perhaps prod: Generate lots of request and a sighup and see if all requests ids are present, try to find repro for dropping... [17:27:03] (03PS5) 10Faidon Liambotis: raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) [17:28:32] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2331150 (10Ottomata) https://gist.github.com/ottomata/7048012 [17:30:18] jynus: needs help? [17:30:38] nah, now that it crashed, I am upgrading and restarting it [17:32:11] problem is in what state is after restart [17:32:16] mutante: robh, fyi, ubuntu worked [17:32:30] so its an issue with jessie specifically, and not our network [17:32:35] progress [17:32:57] (03CR) 10jenkins-bot: [V: 04-1] raid: add HP's RAID tool to the list [puppet] - 10https://gerrit.wikimedia.org/r/290987 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis) [17:33:04] volans, on the good side, we know what caused es2019/es2017 crashes [17:33:33] lol wtf jenkins [17:33:35] 18 minutes? [17:33:37] robh yay progress [17:33:46] for something that doesn't look like an error anyway [17:33:49] ottomata1: oh! right, so installer issue with HP .. _again_ :/ [17:33:55] jynus: yay, saw the emails [17:33:58] (03CR) 10Faidon Liambotis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/290987 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis) [17:36:24] mutante: its HP jessie issue? [17:37:15] !log starting mobileapps deployment [17:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:25] (03CR) 10Dereckson: "Ping?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288582 (https://phabricator.wikimedia.org/T135212) (owner: 10Lokal Profil) [17:37:33] hey akosiaris, did you disable puppet on carbon? [17:37:36] it was disabled a bit ago [17:37:38] and i disabled it too [17:37:43] but not sure if someone reenabled in between [17:37:47] so not sure if i should reenable it [17:37:58] ottomata: i just say that because we know it doesnt happen with trusty and it's HP hardware and we had an installer bug before [17:38:30] sigh, ok [17:38:42] what should I do? [17:39:13] ottomata: create a ticket and paste the error from earlier with the package dependency issue [17:39:42] 06Operations, 10cassandra: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2331203 (10Eevans) [17:39:42] let me look for an older one [17:39:43] mutante: tags? [17:39:57] ottomata: just "operations" i guess [17:40:10] k [17:40:56] (03CR) 10Dzahn: [C: 032] scap: add labtestwikitech to mediawiki-installation group [puppet] - 10https://gerrit.wikimedia.org/r/290348 (owner: 10Alex Monk) [17:41:05] 06Operations, 10cassandra, 10procurement: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2331215 (10faidon) p:05Triage>03Normal [17:41:51] (03PS6) 10Faidon Liambotis: raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) [17:42:55] ottomata: yeah and i'll add in my findings as well if i find the old host i had issue with last week [17:43:24] 06Operations: Jessie install on HP Fails - https://phabricator.wikimedia.org/T136341#2331222 (10Ottomata) [17:43:35] 06Operations: Jessie install on HP Fails - https://phabricator.wikimedia.org/T136341#2331237 (10Ottomata) [17:43:37] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup/deploy 3 eqiad druid nodes - https://phabricator.wikimedia.org/T134275#2260510 (10Ottomata) [17:43:43] ok robh [17:43:52] https://phabricator.wikimedia.org/T136341 [17:43:54] there it is [17:44:17] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup/deploy 3 eqiad druid nodes - https://phabricator.wikimedia.org/T134275#2260510 (10Ottomata) Currently stuck on T136341. :( [17:44:17] i tried but couldnt find a HP specific one yet [17:45:41] why would that be HP specific? [17:46:14] why would a dpkg error about bind9-host be HP specific, seriously [17:47:35] ottomata: try reinstalling that system [17:47:51] just had vague memories of another install issue we had in the past with the HP servers [17:49:42] paravoid: ok [17:49:50] wait, uhhh [17:49:52] druid1001 intalled! [17:49:56] just looked back at a screen [17:49:59] ooooooook [17:50:10] I updated d-i, not sure if there was any change [17:50:19] try reinstalling the one that failed [17:50:40] 06Operations: Set jessie as the default os installer on network boot and manually mark other versions (precise, trusty) - https://phabricator.wikimedia.org/T133539#2234934 (10Dzahn) Recently looked.. there are many jessie but this switch really starts to make sense once the appservers are switching now because t... [17:51:17] ottomata: try it more than once to make sure! ;] [17:51:48] (03PS1) 10Faidon Liambotis: raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) [17:52:32] uhhh druid1003 also installed jessie, i think while i wasn't looking...? [17:52:40] yeah i mean we should reinstall and watch it [17:52:49] ok, well, 1002 needs to go [17:52:51] so doing that now [17:52:53] cool [17:52:57] paravoid: thank you! [17:53:30] it was a transient issue before when i had the package conflict messages similar to this [17:53:38] in that i had it one evening, and by the next day i did not. [17:54:11] and trying to find something in every phab task I touch for setups has proven fruitless =P [17:54:18] (so no clue what system it was) [17:56:04] jynus: I had just pulled a whole table out of analytics-store with sqoop a half hour or so before it crashed [17:56:25] !log mobileapps finished deploying 5ce4f31 (n.b. last deployment, on 23 May, appears to have re-deployed b8c396a) [17:56:29] just letting you know because when you bring it back up I was planning on pulling some more (larger) tables [17:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:58:23] well, it crashed because out of memory error [17:59:24] k, then if it crashes again with oom after I sqoop out of it, we'll know it was me :/ [17:59:25] we will see why soon [17:59:38] milimetric, which tables? [17:59:58] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [17:59:58] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [18:00:03] I grabbed simplewiki.logging around 16:00 UTC [18:00:09] nah, not you [18:00:16] k [18:00:30] RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave [18:01:01] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for Pcoombe - https://phabricator.wikimedia.org/T136343#2331296 (10Pcoombe) [18:01:38] milimetric, but please wait until I update on the email [18:01:50] if you start querying it will make the recovery slower [18:04:11] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:04:12] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:04:12] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:04:12] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:04:31] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:04:31] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:04:42] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:04:42] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:04:52] RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:04:53] looks better than I thought [18:05:11] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:05:20] I think only x1 is broken, and it is only 200GB [18:05:23] RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:05:32] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:05:42] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:05:42] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:05:52] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:05:52] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:06:12] PROBLEM - puppet last run on mw2115 is CRITICAL: CRITICAL: puppet fail [18:07:12] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2331306 (10GWicke) Most issues have indeed been addressed, and most of the remaining ones are also well underway. I agree that this task is no lo... [18:09:23] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.13 seconds [18:09:46] (03CR) 10jenkins-bot: [V: 04-1] raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [18:14:22] hmm, cmjohnson1 robh, druid1002 seems to be different [18:14:31] the install looks like it completed properly [18:14:32] then it rebooted [18:14:36] but couldn't [18:14:38] [ 0.113889] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330) [18:14:48] mdadm: No devices listed in conf file were found. [18:14:48] Gave up waiting for root device. Common problems: [18:14:59] oh, disks didnt detect in time [18:15:02] there is a related task for that [18:15:04] lemme find [18:15:15] we've seen that on a number of jessie isntalls [18:15:22] ottomata: you'll want to reference your issue on it as well [18:15:35] https://phabricator.wikimedia.org/T131961 [18:15:49] try just a soft reset to see if it boots [18:16:01] though the corrupted hw-pmu is new [18:16:08] the root device error sounds related. [18:17:47] soft reset? [18:18:09] milimetric, things look more or less good, but I would suggest wait for a day for doing heavy queries as now they may be slower than usual [18:18:42] yep, I saw your email. Hm... some of this stuff is time sensitive, I will try one table smaller than the one I grabbed before and see if things go well [18:18:50] (I don't need fresh results) [18:18:54] ok did power reset [18:18:56] trying to boot [18:19:42] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 103.05 seconds [18:20:46] yup, robh, it booted [18:20:49] this tiem [18:20:53] still printed [ 0.113817] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330) [18:21:03] so i dunno what that is at all [18:21:17] you may wanna make a task so we try to figure it out, but if its not blocking you, it can be lower priority [18:21:38] its not blocking me [18:21:41] hm [18:22:07] http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=mmr_kc-0126190 [18:22:13] so yeah, its the power settings [18:22:31] solution is there [18:22:32] ok will make ticket [18:22:41] but its not a big deal, we should figure out the setting so make a task [18:22:49] and i'll chase it down and update docs later, but it doesnt stop anything [18:22:52] k [18:23:01] its just the kernel thinks it should be able to control power settings and its not being allowed to [18:23:32] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 283.39 seconds [18:23:48] robh https://phabricator.wikimedia.org/T136345 [18:23:49] 06Operations: HP Warning on boot [Firmware Bug]: the BIOS has corrupted hw-PMU resources - https://phabricator.wikimedia.org/T136345#2331341 (10Ottomata) [18:24:13] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [18:26:11] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:26:21] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [18:28:16] 06Operations, 10Traffic, 13Patch-For-Review, 07Performance: Lots of Title::purgeExpiredRestriction from API DELETE FROM `page_restrictions` WHERE (pr_expiry < '20160517063108') without batching/throttling potentially causing lag on s5-api - https://phabricator.wikimedia.org/T135470#2331377 (10aaron) 05Ope... [18:28:18] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2331379 (10aaron) [18:28:31] (03PS1) 10Ottomata: Fix lvname for druid volume in druid-4ssd-raid10.cfg [puppet] - 10https://gerrit.wikimedia.org/r/291004 [18:28:59] (03CR) 10Ottomata: [C: 032 V: 032] Fix lvname for druid volume in druid-4ssd-raid10.cfg [puppet] - 10https://gerrit.wikimedia.org/r/291004 (owner: 10Ottomata) [18:31:20] (03PS2) 10Dzahn: Apache: redirect pk.wikimedia.org to wikimediapakistan.org [puppet] - 10https://gerrit.wikimedia.org/r/286147 (https://phabricator.wikimedia.org/T56780) (owner: 10Dereckson) [18:31:41] 06Operations: Jessie install on HP Fails - https://phabricator.wikimedia.org/T136341#2331413 (10Ottomata) 05Open>03Invalid Dunno what was up, but this problem went away. [18:31:43] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup/deploy 3 eqiad druid nodes - https://phabricator.wikimedia.org/T134275#2331415 (10Ottomata) [18:31:59] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup/deploy 3 eqiad druid nodes - https://phabricator.wikimedia.org/T134275#2260510 (10Ottomata) a:05Cmjohnson>03Ottomata [18:32:18] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup/deploy 3 eqiad druid nodes - https://phabricator.wikimedia.org/T134275#2260510 (10Ottomata) Servers are installed! [18:34:42] (03CR) 10Dzahn: [C: 032] "checked with apache-fast-test, mw1017 canary, was on swat window already" [puppet] - 10https://gerrit.wikimedia.org/r/286147 (https://phabricator.wikimedia.org/T56780) (owner: 10Dereckson) [18:35:42] RECOVERY - puppet last run on mw2115 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:46:32] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 247.60 seconds [18:47:31] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 80.00 seconds [18:50:47] (03PS3) 10Faidon Liambotis: Create raid module to hold RAID monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/290986 (https://phabricator.wikimedia.org/T84050) [18:50:49] (03PS3) 10Faidon Liambotis: raid: add HP's RAID tool to the list [puppet] - 10https://gerrit.wikimedia.org/r/290987 (https://phabricator.wikimedia.org/T97998) [18:50:51] (03PS7) 10Faidon Liambotis: raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) [18:50:53] (03PS2) 10Faidon Liambotis: raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) [18:50:55] (03PS1) 10Faidon Liambotis: raid: slightly change check-raid's "utility" names [puppet] - 10https://gerrit.wikimedia.org/r/291011 [18:50:57] (03PS1) 10Faidon Liambotis: raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 [18:50:59] (03PS1) 10Faidon Liambotis: raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) [18:51:01] (03PS1) 10Faidon Liambotis: raid: add monitoring for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) [18:51:38] ...and with that, ttyl :) [18:51:51] godog, volans, jynus, akosiaris: ^^^ :) [18:54:23] (03CR) 10jenkins-bot: [V: 04-1] raid: slightly change check-raid's "utility" names [puppet] - 10https://gerrit.wikimedia.org/r/291011 (owner: 10Faidon Liambotis) [18:55:57] (03CR) 10jenkins-bot: [V: 04-1] raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160526T1900). [19:03:29] paravoid: great, thanks, I will take a look [19:04:06] (03PS2) 10Dzahn: udp2log: move icinga checks from ./files/ to module [puppet] - 10https://gerrit.wikimedia.org/r/290871 [19:04:18] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/2932/" [puppet] - 10https://gerrit.wikimedia.org/r/290871 (owner: 10Dzahn) [19:04:19] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 178.09 seconds [19:08:27] (03CR) 10jenkins-bot: [V: 04-1] raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [19:10:33] (03CR) 10jenkins-bot: [V: 04-1] raid: add HP's RAID tool to the list [puppet] - 10https://gerrit.wikimedia.org/r/290987 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis) [19:11:46] (03PS2) 10Dzahn: udp2log: mv rolematcher.py PacketLossLogtailer.py to module [puppet] - 10https://gerrit.wikimedia.org/r/290872 [19:25:40] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2331636 (10Papaul) Will be receiving memory replacement tomorrow Service Request 930250087 <<#7521282-32655863#>> Service Request 930256880 <<#7521282-32654588#>> [19:27:57] (03CR) 10jenkins-bot: [V: 04-1] udp2log: mv rolematcher.py PacketLossLogtailer.py to module [puppet] - 10https://gerrit.wikimedia.org/r/290872 (owner: 10Dzahn) [19:28:21] dear jouncebot: :P [19:28:43] !log mwscript initSiteStats.php --wiki fowiki --update (T136353) [19:28:45] T136353: Reset statistics for fo.wikipedia - https://phabricator.wikimedia.org/T136353 [19:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:39] !log aaron@tin Synchronized php-1.28.0-wmf.3/includes/api/ApiStashEdit.php: 9a9ec26d25 (duration: 00m 24s) [19:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:02] twentyafterfour are you deploying soon? [19:32:41] I'm around just a bit in unlikely case of a problem with our code [19:33:06] audephone: yes going to push wmf.3 now. Thank you! [19:33:13] Okay [19:34:25] (03PS1) 1020after4: all wikis to 1.28.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291020 [19:35:22] (03CR) 1020after4: [C: 032] all wikis to 1.28.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291020 (owner: 1020after4) [19:36:09] (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291020 (owner: 1020after4) [19:40:44] twentyafterfour: any issues with the new labtestwikitech in dsh group? [19:41:02] robh: just about to find out [19:41:13] heh, cool, im standing by to depool it in dsh if it does [19:41:18] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.3 [19:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:41:34] all good [19:41:57] robh: no problems with that. do I need to explicitly sync to make sure it's got everything up to date? [19:42:34] im not entirely certain, i was just asked to be around in case it scerwed up! its testing wikitech stuff is my understanding [19:42:38] :) [19:43:00] I am hanging out a bit more [19:43:17] twentyafterfour: mainly if it shows issues during your syncs we depool and let them sort it out later =] [19:43:33] since its in itself a test box, im not sure we should spend time toubleshooting it. [19:43:38] troubleshooting even. [19:43:44] damn I cannot type today =P [19:44:03] I'm seeing a lot of Unknown namespace ID: 108 [19:44:19] in search... [19:44:23] I don't know what it is [19:44:38] Probably ask erikb [19:48:32] audephone: I brought it up in #wikimedia-discovery [19:49:04] Ok [19:50:10] so no issues iwth the new labtestwikitech though [19:50:11] ? [19:50:15] so I'm probably gonna have to roll back at least itwiki, full text search is failing there [19:50:24] robh: none with scap syncing [19:50:27] cool [19:50:35] that was the concern, so glad it didnt happen =] [19:50:42] (not cool for your other issues, those stink) [19:51:02] robh :) .. just problems with the wmf.3 branch, nothing for you to worry about I think [19:51:19] I don't think worry entered into it [19:51:26] * robh has been noming lunch this entire time [19:51:53] well, half a lunch, i ran out of foodstuffs and need to go grocery shopping this evening. [19:52:18] in fact, since there wasnt an issue, im going to run down the street for something else to eat, back shortly. [19:52:51] (03PS1) 10Hashar: apache: logrotate augeas rule needs apache2 package [puppet] - 10https://gerrit.wikimedia.org/r/291024 [19:53:42] cool [19:54:40] audephone: Notice: Undefined index: entities in /srv/mediawiki/php-1.28.0-wmf.3/extensions/Wikidata/extensions/ArticlePlaceholder/includes/SearchHookHandler.php on line 243 [19:54:49] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:56:25] Twentyafterfour known and we have a fix [19:56:33] I can deploy it tomorrow [19:57:25] audephone: ok cool thank you [19:57:44] Thanks for poking me [19:58:49] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:03:51] 06Operations, 06Discovery, 10Maps, 10Tilerator, 03Discovery-Maps-Sprint: water_polygons import is broken - https://phabricator.wikimedia.org/T112831#2331757 (10MaxSem) 05Open>03Resolved a:03MaxSem Nah, no more recurrences. [20:04:43] (03PS10) 10Ottomata: Initial debian packaging [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) [20:05:16] (03PS2) 10Ottomata: Update Kafka analytics broker list for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287741 [20:13:46] akosiaris: yt? [20:13:52] got some more reprepro updates qs [20:16:48] (03PS1) 10Yuvipanda: tools; Use only one uwsgi worker for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/291038 [20:17:05] andrewbogott: chasemp ^ should solidify checker against more false positives too [20:17:21] paravoid: yt? [20:17:24] and is a pre-req for the webservice job working reliably I Think [20:17:32] (03PS2) 10Yuvipanda: tools: Use only one uwsgi worker for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/291038 [20:17:48] can it still get all the tests done that need doing with one worker? [20:18:24] (03CR) 10Hashar: [V: 031] "That occurs when including contint::localhost_worker and is purely an ordering issue." [puppet] - 10https://gerrit.wikimedia.org/r/291024 (owner: 10Hashar) [20:18:26] andrewbogott: yup, it'll just block [20:18:44] andrewbogott: it'll just serialize access [20:18:50] (03PS2) 10Hashar: apache: logrotate augeas rule needs apache2 package [puppet] - 10https://gerrit.wikimedia.org/r/291024 (https://phabricator.wikimedia.org/T136301) [20:18:51] ok [20:19:05] (03CR) 10Andrew Bogott: [C: 031] tools: Use only one uwsgi worker for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/291038 (owner: 10Yuvipanda) [20:21:41] (03CR) 10Paladox: [C: 031] apache: logrotate augeas rule needs apache2 package [puppet] - 10https://gerrit.wikimedia.org/r/291024 (https://phabricator.wikimedia.org/T136301) (owner: 10Hashar) [20:24:08] (03CR) 10Ottomata: [C: 032 V: 032] Initial debian packaging [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) (owner: 10Ottomata) [20:25:40] (03PS3) 10Yuvipanda: tools: Use only one uwsgi worker for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/291038 [20:30:05] (03PS4) 10Yuvipanda: tools: Use only one uwsgi worker for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/291038 (https://phabricator.wikimedia.org/T136347) [20:30:25] (03PS5) 10Yuvipanda: tools: Use only one uwsgi worker for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/291038 (https://phabricator.wikimedia.org/T136347) [20:31:38] (03CR) 10Yuvipanda: [C: 032] tools: Use only one uwsgi worker for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/291038 (https://phabricator.wikimedia.org/T136347) (owner: 10Yuvipanda) [20:33:24] (03PS1) 10Ottomata: Include cloudera reprepro updates in jessie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/291043 (https://phabricator.wikimedia.org/T131974) [20:34:13] 06Operations, 10Traffic, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2331912 (10BBlack) So far cp3048 seems slightly better off with the new params at 156G virt + 79G resident, but it will take days to level into a new normal (not recorded a... [20:36:03] (03CR) 10Ottomata: [C: 032] Include cloudera reprepro updates in jessie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/291043 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [20:38:04] ottomata: I merged youto [20:38:05] !log twentyafterfour@tin Synchronized php-1.28.0-wmf.3/includes/specials/SpecialSearch.php: deploy hotfix for itwiki search T136356 (duration: 00m 23s) [20:38:06] T136356: itwiki full text search: Unknown namespace ID: 108 - https://phabricator.wikimedia.org/T136356 [20:38:11] danke YuviPanda [20:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:40:29] "Warning: tag is a metaparam; this value will inherit to all contained resources in the toollabs::kubebuilder definition [20:40:45] i dunno what that is about yet [20:41:06] (03PS1) 10BBlack: VCL: lower TTL caps from 14 to 7 days [puppet] - 10https://gerrit.wikimedia.org/r/291059 (https://phabricator.wikimedia.org/T124954) [20:41:30] (03PS2) 10BBlack: VCL: lower TTL caps from 14 to 7 days [puppet] - 10https://gerrit.wikimedia.org/r/291059 (https://phabricator.wikimedia.org/T124954) [20:41:45] (03PS14) 10Ottomata: Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [20:42:00] (03CR) 10BBlack: [C: 032 V: 032] VCL: lower TTL caps from 14 to 7 days [puppet] - 10https://gerrit.wikimedia.org/r/291059 (https://phabricator.wikimedia.org/T124954) (owner: 10BBlack) [20:43:22] (03PS3) 10Dzahn: udp2log: mv rolematcher.py PacketLossLogtailer.py to module [puppet] - 10https://gerrit.wikimedia.org/r/290872 [20:44:35] (03Abandoned) 10Yuvipanda: Add dh-python to build dependencies [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/290611 (owner: 10Yuvipanda) [20:47:48] (03PS15) 10Ottomata: Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [20:47:50] (03CR) 10Lokal Profil: "Just checking that the ping wasn't meant for me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288582 (https://phabricator.wikimedia.org/T135212) (owner: 10Lokal Profil) [20:48:18] (03PS16) 10Ottomata: Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [20:48:39] so jenkins-bot did it again, fail on pplint-HEAD, rebase, fixed [20:49:06] (03CR) 10Ottomata: [C: 032 V: 032] "Not yet applied anywhere, so safe to merge." [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [20:50:42] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/291099/ is the proper fix for the search and watchlist errors, I can deploy it once it merges [20:51:03] (03CR) 10Yuvipanda: [C: 031] "I've fixed the underlying tests for this now, and it works fine." [puppet] - 10https://gerrit.wikimedia.org/r/290681 (https://phabricator.wikimedia.org/T136162) (owner: 10Rush) [20:54:08] (03PS4) 10Dzahn: udp2log: mv rolematcher.py PacketLossLogtailer.py to module [puppet] - 10https://gerrit.wikimedia.org/r/290872 [20:54:40] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/2933/" [puppet] - 10https://gerrit.wikimedia.org/r/290872 (owner: 10Dzahn) [21:01:53] legoktm: cool [21:03:42] 06Operations: Apt mirror for Ubuntu Trusty hash sum mismatch - https://phabricator.wikimedia.org/T136307#2332011 (10hashar) [21:12:20] !log running update-ubuntu-mirror on carbon to check for T136307 [21:12:21] T136307: Apt mirror for Ubuntu Trusty hash sum mismatch - https://phabricator.wikimedia.org/T136307 [21:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:14] 06Operations: Apt mirror for Ubuntu Trusty hash sum mismatch - https://phabricator.wikimedia.org/T136307#2332025 (10hashar) 05Open>03Resolved a:03hashar Transient issue. it is gone now :) [21:13:59] 06Operations: Apt mirror for Ubuntu Trusty hash sum mismatch - https://phabricator.wikimedia.org/T136307#2330101 (10Dzahn) The cron entry is there, can see in syslog that it runs ... no errors in error.log now running it manually .. [21:19:19] (03PS2) 10Dzahn: move/copy ubuntu-cloud.key into openstack/swift modules [puppet] - 10https://gerrit.wikimedia.org/r/290874 [21:20:18] RECOVERY - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is OK: TCP OK - 0.038 second response time on port 9042 [21:21:32] 06Operations: Apt mirror for Ubuntu Trusty hash sum mismatch - https://phabricator.wikimedia.org/T136307#2332055 (10Dzahn) @hashar sum mismatch might be when the webproxy failed temp i guess. I just synced it manually and it finished without problems. That doesnt change the status reported on launchpad.net thou... [21:24:14] (03PS3) 10Dzahn: move/copy ubuntu-cloud.key into openstack/swift modules [puppet] - 10https://gerrit.wikimedia.org/r/290874 [21:25:00] (03PS2) 10Dzahn: varnish: move errorpage.html from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/290876 [21:36:58] (03Abandoned) 10EBernhardson: Change elasticsearch disk critical from 15% to 13% [puppet] - 10https://gerrit.wikimedia.org/r/290481 (owner: 10EBernhardson) [21:38:29] jouncebot: next [21:38:29] In 1 hour(s) and 21 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160526T2300) [21:42:00] (03CR) 10jenkins-bot: [V: 04-1] varnish: move errorpage.html from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/290876 (owner: 10Dzahn) [21:43:48] !log legoktm@tin Synchronized php-1.28.0-wmf.3/includes/title/MediaWikiTitleCodec.php: TitleParser: In formatTitle(), don't throw exceptions on bad namespaces - T136352, T136356 (duration: 00m 28s) [21:43:49] T136352: Special:EditWatchlist is broken - https://phabricator.wikimedia.org/T136352 [21:43:49] T136356: itwiki full text search: Unknown namespace ID: 108 - https://phabricator.wikimedia.org/T136356 [21:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:49:02] !log legoktm@tin Synchronized php-1.28.0-wmf.3/includes/api/ApiStashEdit.php: Bail out in ApiStashEdit for bots for sanity (duration: 00m 25s) [21:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:50:51] !log legoktm@tin Synchronized php-1.28.0-wmf.2/includes/api/ApiStashEdit.php: Bail out in ApiStashEdit for bots for sanity (duration: 00m 24s) [21:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:54:14] (03PS2) 10Dzahn: nagios: move check_command/config to own file [puppet] - 10https://gerrit.wikimedia.org/r/290877 [21:54:26] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/2934/" [puppet] - 10https://gerrit.wikimedia.org/r/290877 (owner: 10Dzahn) [21:59:45] !log ori@tin Synchronized php-1.28.0-wmf.3/includes/api/ApiStashEdit.php: 8521b7b069: Send edit stash metrics for cache attempts (duration: 00m 25s) [21:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:03:22] (03PS1) 10Ottomata: Apply druid roles in production with initial (guesswork) configuration [puppet] - 10https://gerrit.wikimedia.org/r/291113 (https://phabricator.wikimedia.org/T131974) [22:05:50] (03PS2) 10Ottomata: Apply druid roles in production with initial (guesswork) configuration [puppet] - 10https://gerrit.wikimedia.org/r/291113 (https://phabricator.wikimedia.org/T131974) [22:08:33] (03PS3) 10Ottomata: Apply druid roles in production with initial (guesswork) configuration [puppet] - 10https://gerrit.wikimedia.org/r/291113 (https://phabricator.wikimedia.org/T131974) [22:08:46] (03CR) 10jenkins-bot: [V: 04-1] Apply druid roles in production with initial (guesswork) configuration [puppet] - 10https://gerrit.wikimedia.org/r/291113 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [22:09:40] (03PS4) 10Ottomata: Apply druid roles in production with initial (guesswork) configuration [puppet] - 10https://gerrit.wikimedia.org/r/291113 (https://phabricator.wikimedia.org/T131974) [22:11:34] 06Operations, 10MediaWiki-Categories, 07HHVM: Broken sorting and multi-page categories for Cyrillic wikis - https://phabricator.wikimedia.org/T136281#2332207 (10Joe) [22:13:08] (03CR) 10Ottomata: [C: 032] Apply druid roles in production with initial (guesswork) configuration [puppet] - 10https://gerrit.wikimedia.org/r/291113 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [22:14:56] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2332243 (10Joe) s7 has been completed at 21.50 - as expected, being the smallest sized shard. It went on at a decent speed of ~ 1.6 M records/hour, so we're... [22:15:42] 06Operations, 06Discovery, 10Maps, 10Tilerator, and 2 others: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#2332246 (10Yurik) [22:15:44] 06Operations, 06Discovery, 10Maps, 10Tilerator, and 2 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#2332245 (10Yurik) [22:16:47] (03PS1) 10Hashar: (DO NOT SUBMIT) chromium on hold, drop ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/291116 (https://phabricator.wikimedia.org/T136188) [22:17:20] (03CR) 10Hashar: [C: 04-1 V: 04-1] (DO NOT SUBMIT) chromium on hold, drop ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/291116 (https://phabricator.wikimedia.org/T136188) (owner: 10Hashar) [22:18:09] 06Operations, 06Discovery, 10Maps, 10Tilerator, and 2 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#2332266 (10Yurik) [22:18:11] 06Operations, 06Discovery, 10Maps, 10Tilerator, and 2 others: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#2332265 (10Yurik) 05Open>03Resolved [22:18:28] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: puppet fail [22:19:12] (03PS1) 10Eevans: enable instance restbase2007-c [puppet] - 10https://gerrit.wikimedia.org/r/291117 (https://phabricator.wikimedia.org/T134016) [22:20:10] mutante: can you hook me up with a merge on https://gerrit.wikimedia.org/r/#/c/291117 ? [22:21:03] (03CR) 10Dzahn: [C: 032] "restbase2007-c.codfw.wmnet has address 10.192.16.178" [puppet] - 10https://gerrit.wikimedia.org/r/291117 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [22:22:15] urandom:yep, now it's active [22:22:22] mutante: thank you sir! [22:22:30] yw [22:24:04] (03PS1) 10Ottomata: Add union function from stdlib upstream [puppet] - 10https://gerrit.wikimedia.org/r/291119 [22:24:29] (03PS2) 10Ottomata: Add union function from stdlib upstream [puppet] - 10https://gerrit.wikimedia.org/r/291119 [22:24:39] (03CR) 10Ottomata: [C: 032 V: 032] Add union function from stdlib upstream [puppet] - 10https://gerrit.wikimedia.org/r/291119 (owner: 10Ottomata) [22:26:40] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 03Discovery-Search-Sprint, 07Elasticsearch: Restart elasticsearch clusters for Java update - https://phabricator.wikimedia.org/T135499#2332286 (10Deskana) 05Open>03Resolved p:05Triage>03Normal We did this! [22:28:39] (03PS1) 10Ottomata: Set empty properties hash in druid/coordinator.yaml [puppet] - 10https://gerrit.wikimedia.org/r/291121 [22:28:51] (03PS1) 10Dzahn: add "lint:ignore"s for several "puppet URL without modules" [puppet] - 10https://gerrit.wikimedia.org/r/291122 [22:30:14] !log Bootstrapping restbase2007-c.codfw.wmnet : T134016 [22:30:15] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [22:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:30:53] (03CR) 10Ottomata: [C: 032] Set empty properties hash in druid/coordinator.yaml [puppet] - 10https://gerrit.wikimedia.org/r/291121 (owner: 10Ottomata) [22:33:14] (03PS1) 10Greg Grossmeier: Remove Nik Everett's production access [puppet] - 10https://gerrit.wikimedia.org/r/291125 (https://phabricator.wikimedia.org/T130113) [22:33:40] from nagios docs "You can have Nagios notify you of problems and recoveries pretty much anyway you want: pager, cellphone, email, instant message, audio alert, electric shocker, etc. " [22:33:51] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for Pcoombe - https://phabricator.wikimedia.org/T136343#2332316 (10RobH) a:03Pcoombe As this is simply expanding Peter's access, he already has a shell account setup/live. Additionally, he has already signed the L3 document. @pcoombe:... [22:34:26] (03PS1) 10Ottomata: Use quotes in some druid yaml values [puppet] - 10https://gerrit.wikimedia.org/r/291126 [22:35:34] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for Pcoombe - https://phabricator.wikimedia.org/T136343#2331283 (10Ottomata) If you are looking for files just hosted on disk on stat1002, then you want `statistics-privatedata-users`. I think this is probably what you need. If this data is i... [22:36:11] 06Operations, 10Monitoring, 07Icinga: re-create script for manual paging - https://phabricator.wikimedia.org/T82937#2332342 (10Dzahn) https://old.nagios.org/developerinfo/externalcommands/commandinfo.php?command_id=134 [22:36:59] (03CR) 10Ottomata: [C: 032 V: 032] Use quotes in some druid yaml values [puppet] - 10https://gerrit.wikimedia.org/r/291126 (owner: 10Ottomata) [22:37:00] stupid gerrit mangling task urls [22:37:40] PROBLEM - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is CRITICAL: Connection refused [22:37:50] (03CR) 10Alex Monk: "The email was about deployment access but this removes elasticsearch+logstash root access as well?" [puppet] - 10https://gerrit.wikimedia.org/r/291125 (https://phabricator.wikimedia.org/T130113) (owner: 10Greg Grossmeier) [22:39:34] (03PS1) 10Ottomata: Remove extraneous quote in druid/middlemanager.yaml [puppet] - 10https://gerrit.wikimedia.org/r/291127 [22:39:50] (03CR) 10Ottomata: [C: 032 V: 032] Remove extraneous quote in druid/middlemanager.yaml [puppet] - 10https://gerrit.wikimedia.org/r/291127 (owner: 10Ottomata) [22:41:51] (03CR) 10Greg Grossmeier: "Production deployment access includes elastic and logstash." [puppet] - 10https://gerrit.wikimedia.org/r/291125 (https://phabricator.wikimedia.org/T130113) (owner: 10Greg Grossmeier) [22:42:10] (03PS1) 10Ottomata: Add analytics_cluster::hadoop::client to druid workers so CDH is installed [puppet] - 10https://gerrit.wikimedia.org/r/291128 (https://phabricator.wikimedia.org/T131974) [22:42:45] (03PS2) 10Ottomata: Add analytics_cluster::hadoop::client to druid workers so CDH is installed [puppet] - 10https://gerrit.wikimedia.org/r/291128 (https://phabricator.wikimedia.org/T131974) [22:43:06] (03CR) 10Ottomata: [C: 032 V: 032] Add analytics_cluster::hadoop::client to druid workers so CDH is installed [puppet] - 10https://gerrit.wikimedia.org/r/291128 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [22:45:26] 06Operations, 06Discovery, 10Maps, 10Tilerator, 10Traffic: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#2332371 (10Yurik) [22:47:54] (03PS1) 10Ottomata: Install the druid service package for each service [puppet] - 10https://gerrit.wikimedia.org/r/291129 (https://phabricator.wikimedia.org/T131974) [22:48:29] (03CR) 10Ottomata: [C: 032 V: 032] Install the druid service package for each service [puppet] - 10https://gerrit.wikimedia.org/r/291129 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [22:49:49] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [22:50:02] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is CRITICAL: Connection refused eevans Node is bootstrapping. - The acknowledgement expires at: 2016-05-28 22:49:44. [22:52:20] (03PS8) 10Faidon Liambotis: raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) [22:52:22] (03PS2) 10Faidon Liambotis: raid: add monitoring for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) [22:52:24] (03PS2) 10Faidon Liambotis: raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 [22:52:26] (03PS2) 10Faidon Liambotis: raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) [22:52:28] (03PS2) 10Faidon Liambotis: raid: slightly change check-raid's "utility" names [puppet] - 10https://gerrit.wikimedia.org/r/291011 [22:52:31] (03PS3) 10Faidon Liambotis: raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) [22:53:10] (03Abandoned) 10Faidon Liambotis: raid: add HP's RAID tool to the list [puppet] - 10https://gerrit.wikimedia.org/r/290987 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis) [22:53:16] (03PS1) 10Ottomata: s/etc/druid/middleManager/etc/druid/middlemanager/ in druid-middlemanager.dirs [debs/druid] - 10https://gerrit.wikimedia.org/r/291130 [22:53:51] (03PS2) 10Ottomata: s/etc/druid/middleManager/etc/druid/middlemanager/ in druid-middlemanager.dirs [debs/druid] - 10https://gerrit.wikimedia.org/r/291130 [22:53:56] (03CR) 10Faidon Liambotis: [C: 04-2] "Looks fine, but see https://gerrit.wikimedia.org/r/#/c/291014/ (and its ancestors) instead, or IOW, https://gerrit.wikimedia.org/r/#/q/top" [puppet] - 10https://gerrit.wikimedia.org/r/290717 (https://phabricator.wikimedia.org/T97998) (owner: 10Volans) [23:00:04] RoanKattouw ostriches Krenair MaxSem awight Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160526T2300). [23:00:04] dapatrick Krenair ejegg: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:04] (03CR) 10jenkins-bot: [V: 04-1] raid: slightly change check-raid's "utility" names [puppet] - 10https://gerrit.wikimedia.org/r/291011 (owner: 10Faidon Liambotis) [23:00:10] hey [23:00:41] dapatrick, want to do your deploy? [23:00:50] You should have the rights now [23:02:09] Krenair Uh, I'm not certain that I know how to do that. [23:02:19] (03PS3) 10Faidon Liambotis: raid: add monitoring for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) [23:02:21] (03PS3) 10Faidon Liambotis: raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 [23:02:23] (03PS3) 10Faidon Liambotis: raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) [23:02:25] (03PS3) 10Faidon Liambotis: raid: slightly change check-raid's "utility" names [puppet] - 10https://gerrit.wikimedia.org/r/291011 [23:02:38] dapatrick, okay, I'll do it this time [23:02:43] csteipp Okay, thank you. [23:02:52] (03PS1) 10Ottomata: Add temporary debug message to puppet for union [puppet] - 10https://gerrit.wikimedia.org/r/291133 [23:03:23] I'm not csteipp [23:03:49] (03CR) 10Ottomata: [C: 032 V: 032] Add temporary debug message to puppet for union [puppet] - 10https://gerrit.wikimedia.org/r/291133 (owner: 10Ottomata) [23:03:58] Krenair Whoops. Sorry, I was just about to send a message to csteipp. [23:04:03] Krenair Thank you. :) [23:05:03] (03CR) 10jenkins-bot: [V: 04-1] raid: slightly change check-raid's "utility" names [puppet] - 10https://gerrit.wikimedia.org/r/291011 (owner: 10Faidon Liambotis) [23:05:47] PROBLEM - Druid middlemanager on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server middlemanager [23:05:54] haha [23:05:55] alarms! [23:05:56] amazing! [23:05:57] shhh [23:05:58] 06Operations, 10cassandra: change graphite aggregation function for cassandra 'count' metrics - https://phabricator.wikimedia.org/T121789#1888407 (10GWicke) If this is really a gauge, should the cassandra metric reporter perhaps report it as such? [23:06:00] (03PS1) 10Ottomata: More temporary debug info [puppet] - 10https://gerrit.wikimedia.org/r/291134 [23:06:08] PROBLEM - Druid overlord on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server overlord [23:07:12] (03CR) 10Ottomata: [C: 032 V: 032] More temporary debug info [puppet] - 10https://gerrit.wikimedia.org/r/291134 (owner: 10Ottomata) [23:08:20] (03PS1) 10Ottomata: Make debug notify unique [puppet] - 10https://gerrit.wikimedia.org/r/291135 [23:08:45] (03CR) 10Ottomata: [C: 032 V: 032] Make debug notify unique [puppet] - 10https://gerrit.wikimedia.org/r/291135 (owner: 10Ottomata) [23:08:57] PROBLEM - Druid broker on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server broker [23:09:06] i acked that hm [23:09:17] PROBLEM - Druid coordinator on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server coordinator [23:09:18] op, now i did [23:09:22] ACKNOWLEDGEMENT - Druid broker on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server broker ottomata initial install [23:09:22] ACKNOWLEDGEMENT - Druid coordinator on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server coordinator ottomata initial install [23:09:22] ACKNOWLEDGEMENT - Druid historical on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server historical ottomata initial install [23:09:22] ACKNOWLEDGEMENT - Druid middlemanager on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server middlemanager ottomata initial install [23:09:22] ACKNOWLEDGEMENT - Druid overlord on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server overlord ottomata initial install [23:10:13] (03CR) 10Faidon Liambotis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291011 (owner: 10Faidon Liambotis) [23:10:52] unnghhh apparently ruby renders arrays as strings differently in labs than in prod [23:11:04] in puppet at least? [23:11:53] (03CR) 10jenkins-bot: [V: 04-1] raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [23:14:18] !log krenair@tin Synchronized php-1.28.0-wmf.3/extensions/OATHAuth/special/SpecialOATHEnable.php: https://gerrit.wikimedia.org/r/#/c/291007/ (duration: 00m 39s) [23:14:22] dapatrick, ^ [23:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:46] Krenair, Swell, thank again! [23:14:50] does it work? [23:15:09] Verifying now. [23:17:47] Krenair, Yep, it works. [23:18:16] (03PS1) 10Ottomata: Use ruby json lib to render Arrays as strings in druid runtime.properties.erb [puppet] - 10https://gerrit.wikimedia.org/r/291137 (https://phabricator.wikimedia.org/T131974) [23:19:26] (03CR) 10jenkins-bot: [V: 04-1] raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 (owner: 10Faidon Liambotis) [23:19:46] (03CR) 10jenkins-bot: [V: 04-1] raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [23:20:37] Krenair You basically followed the steps at https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1b:_extension.2Fskin.2Fvendor_changes, correct? [23:21:15] (03CR) 10Ottomata: [C: 032] Use ruby json lib to render Arrays as strings in druid runtime.properties.erb [puppet] - 10https://gerrit.wikimedia.org/r/291137 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [23:22:28] RECOVERY - Druid broker on druid1001 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server broker [23:22:44] that's the page yep [23:22:48] RECOVERY - Druid coordinator on druid1001 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server coordinator [23:23:37] RECOVERY - Druid overlord on druid1001 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server overlord [23:24:43] Krenair, Okay, I have not done that for extensions, but I've watched and taken notes when Chris was deploying to core, and yesterday I deployed a config change. It seems pretty similar. [23:24:56] it's similar yes [23:24:58] Krenair, but I'm guess in this case you were able to just update the submodule, correcdt? [23:25:06] but there are also important differences [23:25:11] Gerrit updates the submodule for 99% of extensions [23:25:26] well, you have to submodule update on tin still of course [23:25:57] I mean because there were no "SECURITY:" patches in the log. [23:26:19] We can't discuss that here. [23:26:37] Got it. [23:28:10] Krenair: sorry i'm late to the party. CentralNotice patch is not yet deployed, correct? [23:28:21] correct [23:28:27] PROBLEM - Druid broker on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server broker [23:28:36] cool, i'm available to test whenever that goes out [23:28:37] I was waiting on jenkins and then got distracted and didn't notice it complete [23:28:47] word, no rush [23:32:44] (03PS1) 10BryanDavis: Add pep8 environment to tox.ini for jenkins job [puppet] - 10https://gerrit.wikimedia.org/r/291138 [23:33:44] Dereckson: Hi could you approve the translations at https://www.mediawiki.org/wiki/Template:WikimediaDownload please. [23:33:46] !log krenair@tin Synchronized php-1.28.0-wmf.3/extensions/Math/modules/ve-math/ve.ui.MWMathContextItem.js: https://gerrit.wikimedia.org/r/#/c/290971/ (duration: 00m 28s) [23:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:35:43] hm, didn't seem to take effecr [23:35:45] effect* [23:35:57] * Krenair blames RL caching [23:36:01] !log krenair@tin Synchronized php-1.28.0-wmf.3/extensions/Math/modules/ve-math/ve.ui.MWMathContextItem.js: touch (duration: 00m 27s) [23:36:06] paravoid: https://gerrit.wikimedia.org/r/#/c/291138/ should fix the pep8 jobs failures from misconfiguration. Now it lists the bazillion pep8 violations in the repo [23:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:37:38] that worked [23:39:07] paladox: ask translation admin rights? [23:39:19] OK [23:40:17] !log krenair@tin Synchronized php-1.28.0-wmf.3/extensions/VisualEditor/modules/ve-mw/dm/models/ve.dm.MWTransclusionModel.js: https://gerrit.wikimedia.org/r/#/c/290994/ (duration: 00m 25s) [23:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:31] PROBLEM - puppet last run on mw2128 is CRITICAL: CRITICAL: Puppet has 1 failures [23:41:53] bah, same problem [23:42:01] RECOVERY - Druid broker on druid1001 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server broker [23:42:02] now it works [23:42:14] ejegg, your turn [23:42:35] ugh it's CentralNotice with the nonstandard deployment branches [23:43:32] (03PS1) 10Ottomata: Druid puppet improvements for prod [puppet] - 10https://gerrit.wikimedia.org/r/291140 (https://phabricator.wikimedia.org/T131974) [23:44:15] going through jenkins... [23:44:25] CUSTOM - DPKG on planet2001 is OK: All packages OK [23:44:44] that was a test [23:44:44] (03CR) 10Ottomata: [C: 032 V: 032] Druid puppet improvements for prod [puppet] - 10https://gerrit.wikimedia.org/r/291140 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [23:46:14] Krenair: yah, gotta have the same version everywhere [23:47:11] RECOVERY - Druid middlemanager on druid1001 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server middleManager [23:47:22] CUSTOM - DPKG on planet2001 is OK: All packages OK [23:52:09] (03PS1) 10Dzahn: rcstream: let wikitech connect to redis via IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/291142 [23:53:03] (03PS2) 10Alex Monk: rcstream: let wikitech connect to redis via IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/291142 (https://phabricator.wikimedia.org/T136245) (owner: 10Dzahn) [23:53:11] (03PS3) 10Alex Monk: rcstream: let wikitech connect to redis via IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/291142 (https://phabricator.wikimedia.org/T136245) (owner: 10Dzahn) [23:53:20] (03CR) 10Alex Monk: [C: 031] rcstream: let wikitech connect to redis via IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/291142 (https://phabricator.wikimedia.org/T136245) (owner: 10Dzahn) [23:56:54] !log krenair@tin Synchronized php-1.28.0-wmf.3/extensions/CentralNotice/resources/subscribing: https://gerrit.wikimedia.org/r/#/c/291120/1 (duration: 00m 24s) [23:56:55] ejegg, ^ there you go, sorry for the wait [23:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:09] thanks Krenair , I'll take a look! [23:57:25] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/2937/rcs1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/291142 (https://phabricator.wikimedia.org/T136245) (owner: 10Dzahn) [23:59:17] I have a couple of my own patches to do [23:59:48] eh.. surprise Krenair.. [23:59:54] it doesnt do what we expected