[00:03:54] PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: Puppet has 1 failures [00:12:28] reedy@tin:/srv/mediawiki-staging/php-1.24wmf21$ mwscript extensions/WikimediaMaintenance/makeSizeDBLists.php --wiki=mediawikiwiki [00:12:28] DB connection error: Can't connect to MySQL server on '208.80.154.18' (4) (208.80.154.18) [00:12:29] damn wikitech [00:17:09] (03PS1) 10Reedy: Update size related dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160881 [00:17:46] heh, betawikiversity got smaller [00:18:21] (03CR) 10Reedy: [C: 032] Update size related dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160881 (owner: 10Reedy) [00:18:26] (03Merged) 10jenkins-bot: Update size related dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160881 (owner: 10Reedy) [00:18:47] !log reedy Synchronized database lists: (no message) (duration: 00m 15s) [00:18:53] Logged the message, Master [00:20:47] (03PS2) 10Reedy: Set $wgCategoryCollation to 'uca-hr' on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/147922 (https://bugzilla.wikimedia.org/67287) (owner: 10Bartosz Dziewoński) [00:20:50] (03PS2) 10Reedy: Set $wgCategoryCollation to 'xx-uca-et' on all Estonian-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154213 (https://bugzilla.wikimedia.org/54168) (owner: 10Bartosz Dziewoński) [00:20:54] (03PS2) 10Reedy: Set $wgCategoryCollation to 'uca-sk' on skwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/140580 (owner: 10Danny B.) [00:20:57] (03PS2) 10Reedy: Set $wgCategoryCollation to 'uca-fr' on frwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155241 (https://bugzilla.wikimedia.org/69782) (owner: 10Bartosz Dziewoński) [00:21:14] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [00:21:35] RECOVERY - Disk space on lanthanum is OK: DISK OK [00:24:23] (03CR) 10Reedy: [C: 032] Set $wgCategoryCollation to 'uca-sk' on skwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/140580 (owner: 10Danny B.) [00:24:35] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to 'uca-sk' on skwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/140580 (owner: 10Danny B.) [00:25:01] :o [00:25:03] !log reedy Synchronized wmf-config/InitialiseSettings.php: skwiki collation (duration: 00m 15s) [00:25:10] Logged the message, Master [00:26:40] !log Running `mwscript updateCollation.php --wiki=skwiki --previous-collation=uppercase` in screen on tin [00:26:45] Logged the message, Master [00:28:26] looks to speed up... [00:29:43] 190000/744573 [00:30:23] whee [00:31:26] 260000 [00:39:58] 700000 [00:40:44] 15m30.899s [00:40:49] !log updateCollation on skwiki done [00:40:56] Logged the message, Master [00:41:07] (03PS3) 10Reedy: Set $wgCategoryCollation to 'uca-fr' on frwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155241 (https://bugzilla.wikimedia.org/69782) (owner: 10Bartosz Dziewoński) [00:41:12] (03CR) 10Reedy: [C: 032] Set $wgCategoryCollation to 'uca-fr' on frwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155241 (https://bugzilla.wikimedia.org/69782) (owner: 10Bartosz Dziewoński) [00:41:18] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to 'uca-fr' on frwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155241 (https://bugzilla.wikimedia.org/69782) (owner: 10Bartosz Dziewoński) [00:42:26] !log reedy Synchronized wmf-config/InitialiseSettings.php: frwikiversity collation (duration: 00m 17s) [00:42:32] Logged the message, Master [00:42:53] !log running `mwscript updateCollation.php --wiki=frwikiversity --previous-collation=uppercase` in screen on tin [00:42:56] tiny wiki is tiny [00:42:59] Logged the message, Master [00:43:34] !log updateCollation on frwikiversity done [00:43:41] Logged the message, Master [00:44:26] (03CR) 10Reedy: [C: 04-1] "Needs rebasing. lol" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/147922 (https://bugzilla.wikimedia.org/67287) (owner: 10Bartosz Dziewoński) [00:44:46] (03PS3) 10Reedy: Set $wgCategoryCollation to 'xx-uca-et' on all Estonian-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154213 (https://bugzilla.wikimedia.org/54168) (owner: 10Bartosz Dziewoński) [00:44:50] (03CR) 10Reedy: [C: 032] Set $wgCategoryCollation to 'xx-uca-et' on all Estonian-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154213 (https://bugzilla.wikimedia.org/54168) (owner: 10Bartosz Dziewoński) [00:44:55] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to 'xx-uca-et' on all Estonian-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154213 (https://bugzilla.wikimedia.org/54168) (owner: 10Bartosz Dziewoński) [00:45:26] !log reedy Synchronized wmf-config/InitialiseSettings.php: et collations (duration: 00m 15s) [00:45:33] Logged the message, Master [00:45:56] !log running `mwscript updateCollation.php --wiki=etwiki --previous-collation=uppercase` in screen on tin [00:46:02] Logged the message, Master [00:46:04] (03PS1) 10MZMcBride: Minor fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160890 [00:46:31] !log etwikibooks collation updated (280 rows) [00:46:37] Logged the message, Master [00:46:49] !log etwikimedia collation updated (121 rows) [00:46:54] Logged the message, Master [00:47:25] !log etwikiquote collation updated (706 rows) [00:47:31] Logged the message, Master [00:47:49] !log etwikisource collation updated (9918 rows) [00:47:55] Logged the message, Master [00:48:12] !log running `mwscript updateCollation.php --wiki=etwiktionary --previous-collation=uppercase` in screen on tin [00:48:18] Logged the message, Master [00:52:56] !log updateCollation on etwiktionary done [00:53:03] Logged the message, Master [00:53:07] !log updateCollation on etwiki done [00:53:13] Logged the message, Master [00:57:46] (03PS3) 10Reedy: Set $wgCategoryCollation to 'uca-hr' on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/147922 (https://bugzilla.wikimedia.org/67287) (owner: 10Bartosz Dziewoński) [00:58:03] (03CR) 10Reedy: [C: 032] Set $wgCategoryCollation to 'uca-hr' on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/147922 (https://bugzilla.wikimedia.org/67287) (owner: 10Bartosz Dziewoński) [00:58:08] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to 'uca-hr' on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/147922 (https://bugzilla.wikimedia.org/67287) (owner: 10Bartosz Dziewoński) [00:58:42] !log reedy Synchronized wmf-config/InitialiseSettings.php: shwiki collation (duration: 00m 16s) [00:58:49] Logged the message, Master [00:59:16] !log running `mwscript updateCollation.php --wiki=shwiki --previous-collation=uppercase` in screen on tin [00:59:21] Logged the message, Master [01:15:34] !log updateCollation on shwiki done [01:15:42] Logged the message, Master [01:26:06] (03PS1) 10BBlack: fix ipv6 revdns for install2001 [dns] - 10https://gerrit.wikimedia.org/r/160895 [01:26:08] (03PS1) 10BBlack: add v6 dns for acamar+achernar [dns] - 10https://gerrit.wikimedia.org/r/160896 [01:26:20] (03CR) 10jenkins-bot: [V: 04-1] add v6 dns for acamar+achernar [dns] - 10https://gerrit.wikimedia.org/r/160896 (owner: 10BBlack) [01:26:27] (03CR) 10BBlack: [C: 032] fix ipv6 revdns for install2001 [dns] - 10https://gerrit.wikimedia.org/r/160895 (owner: 10BBlack) [01:27:15] (03PS1) 10Springle: depool s6 db1015 and s7 db1039 for codfw cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160897 [01:27:39] (03CR) 10Springle: [C: 032] depool s6 db1015 and s7 db1039 for codfw cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160897 (owner: 10Springle) [01:27:42] (03PS2) 10BBlack: add v6 dns for acamar+achernar [dns] - 10https://gerrit.wikimedia.org/r/160896 [01:27:45] (03Merged) 10jenkins-bot: depool s6 db1015 and s7 db1039 for codfw cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160897 (owner: 10Springle) [01:28:15] (03CR) 10BBlack: [C: 032] add v6 dns for acamar+achernar [dns] - 10https://gerrit.wikimedia.org/r/160896 (owner: 10BBlack) [01:29:02] !log springle Synchronized wmf-config/db-eqiad.php: depool s6 db1015 and s7 db1039 (duration: 00m 20s) [01:29:11] Logged the message, Master [01:33:14] !log xtrabackup clone db1015 to db2028 [01:33:22] Logged the message, Master [01:33:24] !log xtrabackup clone db1039 to db2029 [01:33:29] Logged the message, Master [01:35:09] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=616 [critical =500]: recurring_gc_contribs_missed=0: recurring_gc_failures_missed=0: recurring_gc_jobs_required=959: recurring_gc_schedule_sanity=0 [01:40:11] RECOVERY - check_fundraising_jobs on db1025 is OK: OK missing_thank_yous=0: recurring_gc_contribs_missed=0: recurring_gc_failures_missed=0: recurring_gc_jobs_required=959: recurring_gc_schedule_sanity=0 [01:40:48] (03PS1) 10Springle: assign codfw slaves: x1 db2009, m1 db2010, m2 db2011, m3 db2012 [puppet] - 10https://gerrit.wikimedia.org/r/160898 [01:43:44] (03CR) 10Springle: [C: 032] assign codfw slaves: x1 db2009, m1 db2010, m2 db2011, m3 db2012 [puppet] - 10https://gerrit.wikimedia.org/r/160898 (owner: 10Springle) [01:49:12] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: Puppet has 3 failures [01:54:39] !log xtrabackup clone db1031 to db2009 [01:54:44] Logged the message, Master [02:00:25] !log xtrabackup clone db1016 to db2010 [02:00:31] Logged the message, Master [02:08:02] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Puppet has 1 failures [02:11:12] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3613 MB (3% inode=99%): [02:15:20] !log xtrabackup clone db1046 to db2011 [02:15:27] Logged the message, Master [02:17:53] RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [02:21:27] !log xtrabackup clone db1048 to db2012 [02:21:33] Logged the message, Master [02:26:32] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 1 failures [02:28:22] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [02:43:02] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-17 02:43:02+00:00 [02:43:09] Logged the message, Master [02:43:54] PROBLEM - puppet last run on mw1067 is CRITICAL: CRITICAL: Puppet has 1 failures [02:45:52] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [02:58:39] (03CR) 10Chmarkine: [C: 031] tendril.wm.org - move behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [03:01:02] RECOVERY - Disk space on virt0 is OK: DISK OK [03:02:53] (03PS1) 10Springle: repool db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160906 [03:03:12] RECOVERY - puppet last run on mw1067 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [03:03:44] (03CR) 10Springle: [C: 032] repool db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160906 (owner: 10Springle) [03:03:48] (03Merged) 10jenkins-bot: repool db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160906 (owner: 10Springle) [03:07:00] !log springle Synchronized wmf-config/db-eqiad.php: repool s6 db1015 (duration: 01m 41s) [03:07:06] Logged the message, Master [03:11:02] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 1 failures [03:17:38] !log LocalisationUpdate completed (1.24wmf21) at 2014-09-17 03:17:38+00:00 [03:17:44] Logged the message, Master [03:29:14] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [04:30:52] (03PS2) 10Tim Starling: Remove bits.wikimedia.org/robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154234 [04:31:27] (03CR) 10Tim Starling: [C: 032] "PS2: rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154234 (owner: 10Tim Starling) [04:31:31] (03Merged) 10jenkins-bot: Remove bits.wikimedia.org/robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154234 (owner: 10Tim Starling) [04:32:17] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Sep 17 04:32:17 UTC 2014 (duration 32m 16s) [04:32:24] Logged the message, Master [04:34:11] !log tstarling Synchronized docroot/bits: (no message) (duration: 00m 10s) [04:34:17] Logged the message, Master [04:46:13] RECOVERY - CI tmpfs disk space on lanthanum is OK: DISK OK [04:56:24] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Epic puppet fail [05:14:44] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [05:47:50] (03PS3) 10Giuseppe Lavagetto: icinga - use apache::site [puppet] - 10https://gerrit.wikimedia.org/r/160820 (owner: 10Dzahn) [05:49:04] (03CR) 10Giuseppe Lavagetto: [C: 032] icinga - use apache::site [puppet] - 10https://gerrit.wikimedia.org/r/160820 (owner: 10Dzahn) [06:05:59] akosiaris: ping [06:06:26] <_joe_> cajoel: pings at this time in the morning are ok only if coming with coffee [06:06:34] <_joe_> :) [06:06:38] it's 11pm [06:06:41] where I'm sitting [06:06:46] so I need a pillow [06:06:51] <_joe_> cajoel: 8 AM here [06:07:07] <_joe_> I'd use a pillow as well [06:07:14] maybe a snuggie [06:07:30] alex had a gerrit patch set for puppet + openldap [06:07:32] I can't find it [06:08:41] and he's too prolific in gerrit to make it easy [06:08:50] <_joe_> https://gerrit.wikimedia.org/r/#/c/156322/ ? [06:09:09] ding [06:09:10] thanks [06:09:21] apparently I'm no good at gerrit search [06:09:34] <_joe_> it's a skill you develop to survive :P [06:09:46] <_joe_> I usually use owner: status: project: [06:11:41] (03PS2) 10Giuseppe Lavagetto: mediawiki: make HAT appservers a separate cluster in ganglia [puppet] - 10https://gerrit.wikimedia.org/r/160624 [06:16:55] <_joe_> cajoel: btw, my thunderbird heuristic scam detection goes bananas with all the openvas related emails :P [06:17:08] heh [06:17:19] HTML in zip file!! oh no! [06:23:16] Q: when hacking local puppet apply, how do I specify where to pick up a template file? [06:24:01] <_joe_> cajoel: you can specify the puppetdir, templates will be searched in $puppetdir/templates, and in $modulepath/templates [06:24:18] <_joe_> for the specific CLI switches, use the man luke! (as I don't remember) [06:24:28] <_joe_> pretty sure the latter is --modulepath, but check it [06:25:22] (03PS3) 10Giuseppe Lavagetto: mediawiki: make HAT appservers a separate cluster in ganglia [puppet] - 10https://gerrit.wikimedia.org/r/160624 [06:25:43] (03CR) 10Giuseppe Lavagetto: [C: 031] "http://puppet-compiler.wmflabs.org//350/change/160624/html" [puppet] - 10https://gerrit.wikimedia.org/r/160624 (owner: 10Giuseppe Lavagetto) [06:26:27] (03PS1) 10Springle: repool db1039 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160918 [06:26:48] (03CR) 10Springle: [C: 032] repool db1039 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160918 (owner: 10Springle) [06:26:50] --templatedir [06:26:56] (03Merged) 10jenkins-bot: repool db1039 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160918 (owner: 10Springle) [06:26:58] thx [06:27:23] (03CR) 10Alexandros Kosiaris: "Curious as to why. nickel does not initiate connections to machines, does it ?" [puppet] - 10https://gerrit.wikimedia.org/r/160802 (owner: 10Ottomata) [06:27:30] !log springle Synchronized wmf-config/db-eqiad.php: repool s7 db1039 (duration: 00m 08s) [06:27:35] Logged the message, Master [06:27:42] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:28:22] <_joe_> that's bogus [06:28:22] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Epic puppet fail [06:28:24] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Epic puppet fail [06:28:32] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Epic puppet fail [06:28:32] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: Epic puppet fail [06:28:33] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.797 second response time [06:28:36] <_joe_> or at least, it's ok no [06:28:44] PROBLEM - puppet last run on search1007 is CRITICAL: CRITICAL: Epic puppet fail [06:28:47] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:54] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Epic puppet fail [06:28:57] <_joe_> oh it's mod_passenger time [06:29:06] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: Epic puppet fail [06:29:07] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Epic puppet fail [06:29:21] (03CR) 10Jkrauska: "Ordering problem...?" [puppet] - 10https://gerrit.wikimedia.org/r/156322 (owner: 10Alexandros Kosiaris) [06:29:36] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:43] cajoel: gimme a sec, uploading a new way better change [06:29:59] * akosiaris_ 3rd day that my bouncer box is down :-( [06:30:06] akosiaris: found a minor ordering dependancy [06:30:17] got one too [06:30:27] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:27] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:37] dpkg won't install slapd when it finds a slapd.conf [06:30:40] ok [06:30:44] <_joe_> akosiaris_: I can lend you one [06:30:46] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:47] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:47] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:16] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:26] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:27] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:27] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:31] <_joe_> mh [06:31:37] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on db1004 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:39] (03CR) 10Jkrauska: "require => Package['slapd']," [puppet] - 10https://gerrit.wikimedia.org/r/156322 (owner: 10Alexandros Kosiaris) [06:32:53] gotta go put kid back to sleep [06:33:04] akosiaris: email me with details? signing off [06:33:56] cajoel: ok. have a nice sleep [06:33:57] (03PS2) 10Alexandros Kosiaris: WIP: openldap module [puppet] - 10https://gerrit.wikimedia.org/r/156322 [06:34:05] cajoel: ^ there you go [06:34:21] might be back--depends on the kiddo [06:35:12] schema are'n't part of a package? [06:35:42] yeah but not gonna install samba just to get a schema file [06:36:13] <_joe_> definitely not [06:36:23] is there a simple (tar this gerrit up and downlaod it link?) [06:36:27] I've found per file downloads [06:36:29] crap [06:36:33] baby monitor wins [06:36:34] ttyl [06:36:53] signing out too, going to the gym [06:37:16] <_joe_> bye to both of you [06:45:06] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:47] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:06] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:46] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:50] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 12 hours old. [06:47:06] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:47:07] RECOVERY - puppet last run on search1007 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:47:16] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:47:26] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:47:26] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:47:47] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:47:48] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 0 hours old. [06:47:56] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:48:06] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:49:36] RECOVERY - puppet last run on db1004 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:51:36] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Puppet has 2 failures [06:51:47] (03PS1) 10Springle: depool s1 db1061 for codfw cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160921 [06:52:05] (03CR) 10Springle: [C: 032] depool s1 db1061 for codfw cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160921 (owner: 10Springle) [06:52:10] (03Merged) 10jenkins-bot: depool s1 db1061 for codfw cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160921 (owner: 10Springle) [06:52:46] !log springle Synchronized wmf-config/db-eqiad.php: depool s1 db1061 for codfw cloning (duration: 00m 07s) [06:52:52] Logged the message, Master [06:52:52] and we're back [06:53:10] and alex goes to the gym [06:55:35] !log xtrabackup clone db1061 to db2016 [06:55:41] Logged the message, Master [06:58:33] where are gerrit credentials pulled from? [06:59:21] seems I cannot login to gerrit... [07:08:17] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:50:38] (03CR) 10Filippo Giunchedi: [C: 031] tendril.wm.org - move behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [07:57:10] (03CR) 10Springle: [C: 031] tendril.wm.org - move behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [07:58:44] <_joe_> mmmh are you sure moving a monitoring host behind a layer of indirection is a good idea? [07:58:47] <_joe_> I don't [08:00:00] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I think putting any monitoring service behind an indirection layer is a poor decision: availablity of monitoring should only be influenced" [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [08:08:32] (03PS2) 10Filippo Giunchedi: wikimedia.org: clarify labsconsole CNAME [dns] - 10https://gerrit.wikimedia.org/r/160454 [08:09:25] (03CR) 10Filippo Giunchedi: "no real fixes but cleanup, I don't have a strong opinion either way, what about the last PS?" [dns] - 10https://gerrit.wikimedia.org/r/160454 (owner: 10Filippo Giunchedi) [08:30:30] (03PS1) 10Giuseppe Lavagetto: puppet: introduce hiera for production [puppet] - 10https://gerrit.wikimedia.org/r/160924 [08:37:30] (03PS1) 10Filippo Giunchedi: metrics: point to misc-web-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/160925 [08:43:02] (03PS1) 10Filippo Giunchedi: metrics: move from stat1001 to varnish [puppet] - 10https://gerrit.wikimedia.org/r/160926 [08:43:04] (03PS1) 10Filippo Giunchedi: metrics: disable SSL virtualhost and cert [puppet] - 10https://gerrit.wikimedia.org/r/160927 [08:45:22] (03Abandoned) 10Filippo Giunchedi: move metrics.wm.o and metrics-api.wm.o behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/160419 (owner: 10Filippo Giunchedi) [08:45:42] PROBLEM - puppet last run on fenari is CRITICAL: CRITICAL: Puppet has 1 failures [08:45:44] moaaaar misc-web ! [08:46:46] <_joe_> mmh I'm not a fan honestly [08:46:58] <_joe_> but metrics is a good fit (maybe) [08:52:48] yeah it is really a redirect [08:52:54] why not a fan btw _joe_ ? [08:53:38] <_joe_> godog: for monitoring, I feel the less moving parts, the better [08:53:45] <_joe_> but metrics is not really monitoring [08:53:59] <_joe_> so it's a good fit, maybe [08:54:23] <_joe_> (hence my -2 to moving tendril behind misc-web) [08:56:00] (03PS2) 10Alexandros Kosiaris: Removal of all snmptrap functionality [puppet] - 10https://gerrit.wikimedia.org/r/159286 [08:56:20] <_joe_> \o/ [08:56:25] yep for monitoring I tend to agree, even though not being able to access the interface doesn't impact functionality [08:56:37] (03PS9) 10Alexandros Kosiaris: Introducing Service Cluster A, hosting mathoid [puppet] - 10https://gerrit.wikimedia.org/r/156576 (https://bugzilla.wikimedia.org/69990) (owner: 10Physikerwelt) [08:56:55] <_joe_> +10 why is that not merged on smptrap [08:57:14] (03CR) 10Alexandros Kosiaris: [C: 032] Removal of all snmptrap functionality [puppet] - 10https://gerrit.wikimedia.org/r/159286 (owner: 10Alexandros Kosiaris) [08:57:18] (03CR) 10Giuseppe Lavagetto: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/159286 (owner: 10Alexandros Kosiaris) [08:57:31] (03CR) 10Alexandros Kosiaris: [C: 032] Introducing Service Cluster A, hosting mathoid [puppet] - 10https://gerrit.wikimedia.org/r/156576 (https://bugzilla.wikimedia.org/69990) (owner: 10Physikerwelt) [09:01:01] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Epic puppet fail [09:01:38] (03PS1) 10Alexandros Kosiaris: mathoid ganglia cluster renamed to sca [puppet] - 10https://gerrit.wikimedia.org/r/160932 [09:01:52] PROBLEM - puppet last run on sca1002 is CRITICAL: CRITICAL: Epic puppet fail [09:02:52] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 5 failures [09:02:55] (03CR) 10Alexandros Kosiaris: [C: 032] mathoid ganglia cluster renamed to sca [puppet] - 10https://gerrit.wikimedia.org/r/160932 (owner: 10Alexandros Kosiaris) [09:04:48] <_joe_> sca1001? [09:04:51] <_joe_> this is new [09:05:11] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:06:29] _joe_: yeah. Service Cluster A [09:06:34] super obvious :D [09:07:17] so if we ever move a host to service cluster b, we change the hostname? [09:07:52] I would have named them svc#### but hey that is just me [09:07:58] <_joe_> "scab" [09:08:18] can't wait for cluster P [09:10:03] mmhh ldaplist -l group wmf on sanger yields "password incorrect" (both as my user and as root), known issue? [09:10:49] that is, without prompting for a password which makes sense running from root [09:13:22] PROBLEM - SSH on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:14:41] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:15:13] PROBLEM - nutcracker process on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:15] PROBLEM - check if dhclient is running on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:16] PROBLEM - Disk space on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:53] godog: sanger ? [09:15:59] why sanger ? [09:16:11] RECOVERY - nutcracker process on fenari is OK: PROCS OK: 1 process with UID = 116 (nutcracker), command name nutcracker [09:16:11] RECOVERY - check if dhclient is running on fenari is OK: PROCS OK: 0 processes with command name dhclient [09:16:11] RECOVERY - Disk space on fenari is OK: DISK OK [09:18:33] akosiaris: I was looking at https://wikitech.wikimedia.org/wiki/LDAP#LDAP_in_Production from https://wikitech.wikimedia.org/wiki/RT_Triage_Duty#LDAP_group_changes [09:19:33] RECOVERY - SSH on fenari is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:20:07] godog: sanger is the OIT LDAP mirror, not really meant to be modified [09:20:19] OIT LDAP mirror != labs LDAP [09:20:47] 2 different LDAPs... the OIT LDAP mirror is only used for cheap rcpt to checks [09:22:28] akosiaris: ah ok, that makes more sense so virt1000 or virt0 I guess [09:22:55] PROBLEM - check configured eth on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:23:45] RECOVERY - check configured eth on fenari is OK: NRPE: Unable to read output [09:24:54] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [09:27:19] (03CR) 10Springle: "Tendril is more about metrics and inventory than monitoring, but it's certainly a nebulous distinction." [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [09:27:55] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:27:56] _joe_: ^ not disaggreeing with you ;) maybe this sort of thing needs discussion in general [09:28:38] <_joe_> that was basically my point [09:28:44] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:45] RECOVERY - DPKG on fenari is OK: All packages OK [09:30:04] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [09:38:55] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:41:15] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:42:14] PROBLEM - check configured eth on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:43:55] PROBLEM - nutcracker port on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:44:52] <_joe_> sigh, fenari [09:44:54] RECOVERY - nutcracker port on fenari is OK: TCP OK - 0.000 second response time on port 11212 [09:44:54] RECOVERY - DPKG on fenari is OK: All packages OK [09:45:03] yeah, probably again on swap [09:45:08] I am logging in now [09:45:14] RECOVERY - check configured eth on fenari is OK: NRPE: Unable to read output [09:47:29] (03PS1) 10Alexandros Kosiaris: mathoid: Remove duplicate resource and fix docs [puppet] - 10https://gerrit.wikimedia.org/r/160938 [09:47:54] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:59] (03CR) 10Alexandros Kosiaris: [C: 032] mathoid: Remove duplicate resource and fix docs [puppet] - 10https://gerrit.wikimedia.org/r/160938 (owner: 10Alexandros Kosiaris) [09:49:14] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [09:49:54] RECOVERY - DPKG on fenari is OK: All packages OK [09:49:59] (03PS1) 10Springle: repool db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160939 [09:50:55] poor fenari [09:51:13] seems like puppet + whatever that apache was doing did not treat him well [09:51:32] (03CR) 10Springle: [C: 032] repool db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160939 (owner: 10Springle) [09:51:36] (03Merged) 10jenkins-bot: repool db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160939 (owner: 10Springle) [09:52:14] !log springle Synchronized wmf-config/db-eqiad.php: repool s1 db1061 (duration: 00m 08s) [09:52:21] Logged the message, Master [09:52:56] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:52:59] !log stopped apache2 on fenari, it was leaking memory, puppet restarted it, need to kill this machine ASAP [09:53:05] Logged the message, Master [09:53:14] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4775 bytes in 0.103 second response time [09:55:24] I love it that as soon we actively started moving stuff off fenari it fell over, throwing its toys out of the pram [09:55:57] in my $DAYJOB-1 we had a very very old solaris box that finally we were about to decomission [09:56:24] so there is an RT about doing it Monday morning [09:56:43] the box had 0 services, it was all about a last check and shutting it down [09:57:04] and Sunday night it decided to die [09:57:15] the rest of the day was all about harakiri jokes ... [09:57:52] hehehe [09:58:33] (03PS1) 10Alexandros Kosiaris: mathoid: create /srv/deployment [puppet] - 10https://gerrit.wikimedia.org/r/160940 [09:58:46] wonder if notpeter will want fenari too, to go with db9 [09:59:13] huh ? [09:59:30] did notpeter do anything with db9 ? can't remember [09:59:51] (03CR) 10Alexandros Kosiaris: [C: 032] mathoid: create /srv/deployment [puppet] - 10https://gerrit.wikimedia.org/r/160940 (owner: 10Alexandros Kosiaris) [10:03:54] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [10:04:05] yey! [10:04:36] RECOVERY - puppet last run on sca1002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:04:37] I think I am going to ulimit apache2 on fenari [10:08:56] whee.. all core dbs replicated to codfw. just external storage to go and pmtpa is yesterday's news [10:09:01] at least for db data [10:09:17] sweet [10:11:20] akosiaris: i recall some discussion, maybe jokingly, about shipping db9 to notpeter. sentimentality :) [10:19:11] godog: thx for the new jenkins package. Will brea^H^H^H^Hupgrade it this afternoon [10:20:59] hashar: haha no worries, simple enough :) [10:22:41] commuting back home for lunch [10:24:33] matanya: there was this about the broken puppet compiler job, fixed it seems? https://rt.wikimedia.org/Ticket/Display.html?id=8051 [10:25:11] godog: iirc _joe_fixed it [10:27:41] not sure what "broken" meant in the first place :) [10:37:55] <_joe_> godog: "not working" [10:40:06] _joe_: as in the jenkins job would fail entirely? [10:41:20] or it won't run at all for example? many shades (50?) of "not working" as usual [10:48:20] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:09] <_joe_> godog: yep [10:54:50] _joe_: which of the two options? :) [10:56:38] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:57:21] akosiaris: you broke netmon1001 badly [10:57:42] requiring the snmp package to be purged from all machines is perhaps not the best thing to do :) [10:57:52] hmmm [10:58:02] ok fixing [10:58:06] thank you :) [11:01:35] (03PS3) 10Alexandros Kosiaris: Remove the snmptt user [puppet] - 10https://gerrit.wikimedia.org/r/143305 [11:03:58] (03CR) 10Alexandros Kosiaris: [C: 032] Remove the snmptt user [puppet] - 10https://gerrit.wikimedia.org/r/143305 (owner: 10Alexandros Kosiaris) [11:24:24] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: Epic puppet fail [11:24:54] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:46] PROBLEM - puppet last run on db74 is CRITICAL: CRITICAL: Epic puppet fail [11:26:47] (03PS1) 10Alexandros Kosiaris: librenms requires snmp [puppet] - 10https://gerrit.wikimedia.org/r/160944 [11:27:13] mark: fixed manually, seems like the only thing that directly called snmpget/snmpwalk et al. Which is ew, at least all other tools use libsnmp or some other binding. Puppet fix in https://gerrit.wikimedia.org/r/160944, but don't merge yet (I want the rest of the cluster to purge snmp before that) [11:27:59] ok [11:28:14] * akosiaris remembers the require_packages discussion the other day [11:29:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Wait for https://gerrit.wikimedia.org/r/#/c/143306/ to be merged" [puppet] - 10https://gerrit.wikimedia.org/r/160944 (owner: 10Alexandros Kosiaris) [11:37:40] <_joe_> akosiaris: which part? [11:38:29] the avoiding multiple definitions and ensure => absent vs ensure => present [11:38:37] <_joe_> eheh [11:38:47] * _joe_ whistles [11:43:22] (03PS1) 10coren: Labs: Fix generation of meta_p.meta [software] - 10https://gerrit.wikimedia.org/r/160945 (https://bugzilla.wikimedia.org/54962) [11:43:34] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:44:05] RECOVERY - puppet last run on db74 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [11:44:54] (03CR) 10coren: [C: 032 V: 032] "(Reflects current version)" [software] - 10https://gerrit.wikimedia.org/r/160945 (https://bugzilla.wikimedia.org/54962) (owner: 10coren) [11:58:27] (03PS3) 10Alexandros Kosiaris: Remove the last resources of snmp on hosts [puppet] - 10https://gerrit.wikimedia.org/r/143306 [12:00:32] (03CR) 10Alexandros Kosiaris: [C: 032] Remove the last resources of snmp on hosts [puppet] - 10https://gerrit.wikimedia.org/r/143306 (owner: 10Alexandros Kosiaris) [12:00:48] (03CR) 10Alexandros Kosiaris: [C: 032] librenms requires snmp [puppet] - 10https://gerrit.wikimedia.org/r/160944 (owner: 10Alexandros Kosiaris) [12:20:17] !log upgrading jenkins 1.565.1 -> 1.565.2 [12:20:23] Logged the message, Master [12:20:55] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:22:27] (03PS1) 10Alexandros Kosiaris: librenms: Provide a database purge script [puppet] - 10https://gerrit.wikimedia.org/r/160950 [12:35:34] (03PS1) 10Alexandros Kosiaris: apache-graceful-all: drop dsh, use salt [puppet] - 10https://gerrit.wikimedia.org/r/160953 [12:39:17] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, minor comments." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [12:44:45] (03PS2) 10Alexandros Kosiaris: apache-graceful-all: drop dsh, use salt [puppet] - 10https://gerrit.wikimedia.org/r/160953 [12:46:18] (03CR) 10Giuseppe Lavagetto: [C: 031] apache-graceful-all: drop dsh, use salt [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [12:46:22] (03CR) 10Alexandros Kosiaris: apache-graceful-all: drop dsh, use salt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [12:47:04] _joe_: 7 minutes to review comment addressing :-) [12:47:33] <_joe_> :) [12:48:01] (03CR) 10Alexandros Kosiaris: "Here it is: https://gerrit.wikimedia.org/r/#/c/160953/" [puppet] - 10https://gerrit.wikimedia.org/r/160628 (owner: 10Matanya) [12:48:47] (03CR) 10TTO: [C: 04-1] "Might not work. Need to look more carefully at it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://bugzilla.wikimedia.org/15583) (owner: 10TTO) [12:48:52] <_joe_> akosiaris: can salt be run from tin? [12:49:02] <_joe_> mmmm [12:49:10] <_joe_> now I remember what was my blocker [12:49:26] <_joe_> doing it this way will require to be roots [12:49:33] <_joe_> and doing this from the salt master [12:50:17] I think we can address both of this [12:50:19] these* [12:50:21] hmm, looks like I've to modify OpenStackManager to expose 'hosts in a project' info via an API to generate hosts config for shinken [12:50:38] <_joe_> YuviPanda: while you're at it... [12:50:46] <_joe_> labs-hiera in OSM!!! [12:50:50] <_joe_> :D [12:50:53] heh :D [12:51:00] but... but... PHP!!!!! [12:51:02] <_joe_> akosiaris: I think as well [12:51:07] <_joe_> YuviPanda: HHVM! [12:51:10] although, I hope we move to Horizon sooner than later [12:51:15] <_joe_> it's fancy new and "webscale" [12:51:18] _joe_: true but I've to write PHP! :'( [12:51:44] <_joe_> well, other devs get excited to write horrible JAVASCRIPT with CALLBACKS [12:51:53] MW is a bit like HHVM in that it powers a very popular internet website yet to understand most things you've to read the source.... [12:52:02] 6 of them in a row [12:52:27] <_joe_> so I figured that saying "w000t nonblocking webscale" would be enough to make devs love a terrible language [12:52:27] and if the 4th callbacks errs... well... tough luck [12:52:34] are we playing the HAT vs LAMP game? [12:52:40] <_joe_> akosiaris: but it's nonblocking [12:52:54] ON ERROR RESUME NEXT [12:53:06] <_joe_> YuviPanda: Web-cobol? [12:53:06] GOTO 10 ; GOTO 10 ; [12:53:28] <_joe_> I've seen some fortran library for writing CGI programs [12:53:44] i wrote cobol, now i feel old. thank you YuviPanda [12:53:57] <_joe_> akosiaris: but seriously, we can't expect apache-graceful-all to work on tin as-is [12:53:58] :D [12:54:00] heh [12:54:06] * YuviPanda wrote VB6 from waaay back [12:54:15] _joe_: we run some fortran in the stats machines [12:54:26] <_joe_> oh thanks for telling me [12:54:35] python-scipy package :) [12:54:38] lisp ftw :) [12:54:41] _joe_: I did not read that [12:54:44] we run ocaml in prod ;) [12:55:21] trolol [12:55:42] (03PS2) 10Alexandros Kosiaris: librenms: Provide a database purge script [puppet] - 10https://gerrit.wikimedia.org/r/160950 [12:55:51] YuviPanda: do we run BF too ? [12:55:54] <_joe_> Reedy: Non-blocking! event-driven! webscale! [12:55:57] not yet! [12:56:06] ook [12:56:20] or should i say ook!? [12:56:26] (03CR) 10jenkins-bot: [V: 04-1] librenms: Provide a database purge script [puppet] - 10https://gerrit.wikimedia.org/r/160950 (owner: 10Alexandros Kosiaris) [12:57:03] matanya: I like moo better [12:57:24] you should use sl then [12:57:54] pep8.... [12:57:59] I dislike pep8 [13:00:11] (03PS3) 10Alexandros Kosiaris: librenms: Provide a database purge script [puppet] - 10https://gerrit.wikimedia.org/r/160950 [13:00:58] (03CR) 10jenkins-bot: [V: 04-1] librenms: Provide a database purge script [puppet] - 10https://gerrit.wikimedia.org/r/160950 (owner: 10Alexandros Kosiaris) [13:01:45] <_joe_> I dislike badly-configured pep8 [13:02:43] I dislike any program that thinks it knows better than me what my code should look like ;) [13:05:10] <_joe_> mark: I usually use or.i as a code style formatter [13:05:17] <_joe_> it works incredibly well [13:05:31] <_joe_> a few minutes after I merge, usually [13:05:51] (03PS4) 10Alexandros Kosiaris: librenms: Provide a database purge script [puppet] - 10https://gerrit.wikimedia.org/r/160950 [13:07:08] * matanya defends lint [13:07:21] (03PS1) 10Giuseppe Lavagetto: hhvm: switch on stats collection. [puppet] - 10https://gerrit.wikimedia.org/r/160956 [13:07:57] <_joe_> matanya: we said we dislike it, not that we don't think it's useful ;) [13:08:30] _joe_: for me useful == like :D [13:08:59] <_joe_> matanya: not when annoy factor >>> usefulness [13:09:14] <_joe_> matanya: I get you like puppet then ;) [13:09:22] <_joe_> it is useful after all [13:09:42] i can't admit that, but yes :) [13:10:04] mark: are we going to have a mail server like pollonium in codfw ? [13:11:57] yes [13:12:01] of course [13:12:05] why do you even ask [13:13:25] * akosiaris hides in shame :( [13:13:30] :) [13:13:35] we're gonna have everything in codfw [13:13:36] almost [13:13:57] the only reason not to put something there yet if it's really user facing, because atm the network is still unstable [13:14:09] but i guess with ldap it should be ok [13:23:48] mark: Hej. I might annoy you by asking, but do you know what's the status of SNI on misc-web-lb (or who's on that if not you)? Update in https://rt.wikimedia.org/Ticket/Display.html?id=8345 would be welcome... tia [13:24:43] andre__: I will look into that this week yes, hopefully we can make it work [13:24:53] it's unchanged since friday (that I know of) [13:25:11] (03PS2) 10Giuseppe Lavagetto: hhvm: switch on stats collection. [puppet] - 10https://gerrit.wikimedia.org/r/160956 [13:27:00] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: switch on stats collection. [puppet] - 10https://gerrit.wikimedia.org/r/160956 (owner: 10Giuseppe Lavagetto) [13:28:10] mark, thanks [13:29:17] !log upgrading Zuul to 2.0.0.286.gb1811ab [13:29:23] Logged the message, Master [13:33:15] !log stopping zuul for upgrade [13:33:22] Logged the message, Master [13:34:42] wow, the trusty package for shinken itself seems broken?! [13:37:13] yup [13:37:25] service won't start because they seem to have messed up paths [13:37:28] * YuviPanda facepalms [13:38:07] !log Zuul upgraded successfully apparently. [13:38:14] Logged the message, Master [13:38:43] hashar: such confidence [13:39:08] if it is totally broken I can just pretend I had a 3rd child and vanish! [13:49:37] akosiaris: maybe I'm confused, I thougth gmetad connected to the aggregators... [13:50:22] ottomata: hmm I think you are right, let me check [13:51:28] checking too...:) [13:51:31] ottomata: yeah my bad [13:51:49] no worries, i didn't double check that yesterday, just assumed [13:51:51] gmetad connects to other gmetads and the gmond aggregators [13:51:53] you coulda been right! [13:52:33] I wasn't actually thinking about the gmetad when commenting [13:52:40] but gmond [13:52:45] but yeah, ma bad [13:55:15] (03Abandoned) 10BBlack: move baham to private subnet [dns] - 10https://gerrit.wikimedia.org/r/160852 (owner: 10BBlack) [13:55:27] (03Abandoned) 10BBlack: move baham to private subnet [puppet] - 10https://gerrit.wikimedia.org/r/160854 (owner: 10BBlack) [13:55:56] (03CR) 10Ottomata: "Is there a good reason to call the directory 'hieradata/' rather than just 'heira/'?" [puppet] - 10https://gerrit.wikimedia.org/r/160924 (owner: 10Giuseppe Lavagetto) [13:57:01] PROBLEM - puppet last run on cp1043 is CRITICAL: CRITICAL: Puppet has 1 failures [14:00:01] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [14:01:33] (03PS1) 10Mark Bergsma: Only set the ipv6only listen option on the default server [puppet] - 10https://gerrit.wikimedia.org/r/160959 [14:04:06] (03PS1) 10Reedy: WIP: Extract wmf-beta-scap to runas-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 [14:05:25] (03CR) 10Reedy: [C: 04-1] "Where did we say the base script should end up? Deployment or something?" [puppet] - 10https://gerrit.wikimedia.org/r/160960 (owner: 10Reedy) [14:05:35] (03PS1) 10BBlack: turn autoconf back on for v6_mapped ifaces [puppet] - 10https://gerrit.wikimedia.org/r/160961 [14:06:12] (03CR) 10BBlack: [C: 032] turn autoconf back on for v6_mapped ifaces [puppet] - 10https://gerrit.wikimedia.org/r/160961 (owner: 10BBlack) [14:11:32] PROBLEM - puppet last run on magnesium is CRITICAL: CRITICAL: Puppet has 1 failures [14:14:17] magnesium is me [14:14:54] ACKNOWLEDGEMENT - puppet last run on magnesium is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi fixing racktables ssl [14:15:36] (03PS1) 10Filippo Giunchedi: racktables: clean up ssl config, not needed [puppet] - 10https://gerrit.wikimedia.org/r/160963 [14:16:16] * YuviPanda makes godog into a ribbon, lights him up and enjoys the brilliant colors [14:17:23] haha I wasn't aware being made out of fireworks powder [14:17:47] Doesn't magnesium burn white? ;) [14:18:39] Reedy: white is a brilliant color and it produces other brilliant colors when it falls on less brilliantly colored things [14:19:03] YuviPanda: you have just been volunteered for https://gerrit.wikimedia.org/r/#/c/160963/ ! [14:19:31] we had racktables in labs?! [14:20:14] (03CR) 10Ottomata: "Alex and I just chatted on IRC. Answer: Because gmetad polls the gangali aggregators." [puppet] - 10https://gerrit.wikimedia.org/r/160802 (owner: 10Ottomata) [14:20:46] YuviPanda: indeed! [14:20:50] (03CR) 10Yuvipanda: [C: 031] "Looks ok to me. There's also a pem file in puppet we might want to remove?" [puppet] - 10https://gerrit.wikimedia.org/r/160963 (owner: 10Filippo Giunchedi) [14:20:56] godog: I don't have +2 :) [14:21:25] (03PS2) 10Andrew Bogott: Tell palladium about all the domains [puppet] - 10https://gerrit.wikimedia.org/r/160848 [14:22:17] (03CR) 10Ottomata: [C: 032] "Cool, you'll have to remove the cert manually, right?" [puppet] - 10https://gerrit.wikimedia.org/r/160927 (owner: 10Filippo Giunchedi) [14:22:46] (03CR) 10Ottomata: [C: 031] metrics: move from stat1001 to varnish [puppet] - 10https://gerrit.wikimedia.org/r/160926 (owner: 10Filippo Giunchedi) [14:22:49] (03PS1) 10Yuvipanda: labmon: Make tools CPU check more accomodating [puppet] - 10https://gerrit.wikimedia.org/r/160964 [14:22:57] godog: I volunteer you for ^ :) [14:22:58] trivial [14:23:22] PROBLEM - puppet last run on fenari is CRITICAL: CRITICAL: Puppet has 1 failures [14:24:50] (03CR) 10Ottomata: [C: 031] "Hmm, you know, I do'nt think we need metrics-api at all anymore. Let's remove it." [dns] - 10https://gerrit.wikimedia.org/r/160925 (owner: 10Filippo Giunchedi) [14:25:09] (03CR) 10Ottomata: "Actually, let's remove metrics-api. I just asked Dan if we needed it, and he doesn't think we do." [puppet] - 10https://gerrit.wikimedia.org/r/160926 (owner: 10Filippo Giunchedi) [14:27:37] (03PS2) 10Mark Bergsma: Only set the ipv6only listen option on the default server [puppet] - 10https://gerrit.wikimedia.org/r/160959 [14:28:04] (03CR) 10coren: [C: 032] "It's better than status quo." [puppet] - 10https://gerrit.wikimedia.org/r/160964 (owner: 10Yuvipanda) [14:28:21] Coren: can you also force a run on neon? [14:29:18] YuviPanda: oh, bad timing? I see it is merged already :) [14:29:33] godog: yeah :) I'll volunteer you for something else later on, I suppose :) [14:29:49] * YuviPanda will have to volunteer people for a lot of things for the next 6 weeks, after which I'll presumably have +2 myself [14:29:53] haha sounds good, RT anyways this week YuviPanda [14:30:00] :D ok [14:30:18] godog: I got James_F|Away to file a bunch of RT requests to add people to the wmf ldap group [14:31:53] (03PS2) 10Filippo Giunchedi: racktables: clean up ssl config, not needed [puppet] - 10https://gerrit.wikimedia.org/r/160963 [14:32:16] YuviPanda: ye saw that, some didn't exist and some where already in the group, I've added the rest [14:32:23] ah cool [14:32:42] godog: can you add bdziewonski@wikimedia.org as well? [14:32:47] not sure if there was a request for him [14:34:18] YuviPanda: best to comment on the RT so we don't lose track I think, RT 8375 [14:34:22] will do [14:35:06] godog: done [14:37:11] Reedy: Have time for yet another wikitech bug? https://bugzilla.wikimedia.org/show_bug.cgi?id=70641 <- wikitech thinks it is 'wikipedia' for interwiki links. [14:37:52] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:38:12] YuviPanda: {{done}} [14:38:19] Coren: cool, ty [14:40:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] racktables: clean up ssl config, not needed [puppet] - 10https://gerrit.wikimedia.org/r/160963 (owner: 10Filippo Giunchedi) [14:41:32] andrewbogott: I wonder if labswiki should be in special.dblist, and we need to rebuild the interwiki cache [14:41:52] RECOVERY - puppet last run on magnesium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:43:09] (03PS1) 10Andrew Bogott: Add labswiki to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160966 [14:43:18] Reedy: is that what you mean? [14:43:54] Yeah [14:44:01] I'm just looking through the dumpInterwiki code [14:44:35] (03CR) 10Reedy: [C: 032] Add labswiki to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160966 (owner: 10Andrew Bogott) [14:44:39] (03Merged) 10jenkins-bot: Add labswiki to special.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160966 (owner: 10Andrew Bogott) [14:45:02] * aude makes labswiki at wikidata client! :D [14:45:07] !log restarted apache2 on magnesium, validate removal of ssl certs [14:45:13] Logged the message, Master [14:45:20] then add it to the sites table :) [14:45:32] Instance lists on WikiData! [14:45:37] So I don't have to write PHP!!!1 [14:45:38] :) [14:45:38] heh [14:45:41] probably not :( [14:45:46] (03PS3) 10Mark Bergsma: Only set the ipv6only listen option on the default server [puppet] - 10https://gerrit.wikimedia.org/r/160959 [14:46:12] of course, dumpInterwiki is just a bit broken [14:47:58] (03PS1) 10Reedy: Update IW cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160967 [14:48:11] (03CR) 10Reedy: [C: 032] Update IW cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160967 (owner: 10Reedy) [14:48:16] (03Merged) 10jenkins-bot: Update IW cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160967 (owner: 10Reedy) [14:48:31] andrewbogott: try sync-common... [14:48:42] !log reedy Synchronized wmf-config/interwiki.cdb: (no message) (duration: 00m 16s) [14:48:43] Reedy: ok! [14:48:48] Logged the message, Master [14:49:28] (03PS4) 10Mark Bergsma: Only set the ipv6only listen option on the default server [puppet] - 10https://gerrit.wikimedia.org/r/160959 [14:50:14] (03CR) 10Ottomata: [C: 031] Only set the ipv6only listen option on the default server [puppet] - 10https://gerrit.wikimedia.org/r/160959 (owner: 10Mark Bergsma) [14:50:22] :) [14:54:26] !log db1062 out of action for bug hunt https://mariadb.atlassian.net/browse/MDEV-6751 [14:54:35] Logged the message, Master [14:54:36] Reedy: just curious, why does sync-common always display a finished message ("Finished rsync common (duration: 03m 14s)") and then take ~5 minutes more to actually return? What's it doing? [14:54:45] Who's swatting today? [14:55:13] Oh, no patches, neato [14:55:53] Reedy: also, same issue is still present :( https://wikitech.wikimedia.org/w/index.php?title=Wikipedia:Dallas/Fort_Worth_International_Airport&action=edit&redlink=1 [14:57:56] (03PS1) 10Giuseppe Lavagetto: Updating changelog, modified the name of the package to python-diamond. [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/160969 [14:59:08] https://wikitech.wikimedia.org/wiki/Special:Interwiki [14:59:15] (03PS2) 10Giuseppe Lavagetto: Updating changelog, modified the name of the package to python-diamond. [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/160969 [14:59:17] The map puts wikipedia //en.wikipedia.org/wiki/$1 [14:59:56] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Updating changelog, modified the name of the package to python-diamond. [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/160969 (owner: 10Giuseppe Lavagetto) [15:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140917T1500). Please do the needful. [15:00:18] Still no patches! [15:00:23] I declare SWAT a great success. [15:00:27] * anomie sees nothing to SWAT this morning [15:00:29] Cookies and punch at my house [15:01:38] Reedy: the // makes it a relative link? [15:01:49] protocol relative, yeah [15:02:12] well… that's correct, isn't it? [15:02:18] yup [15:02:37] The extension is just displaying the contents of the table/cdb [15:02:44] MediaWiki still can do stupid stuff [15:03:11] (03PS7) 10Yuvipanda: [WIP] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [15:05:03] the parser (presumably) isn't recognising it as a interwiki link [15:06:06] (03PS8) 10Yuvipanda: [WIP] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [15:06:52] Reedy: I'm about to abandon you in favor of breakfast, sorry -- back in 30 or so. [15:12:35] eval.php on tin knows it's an interwiki too [15:13:00] (03CR) 10BryanDavis: "> Where did we say the base script should end up? Deployment or something?" [puppet] - 10https://gerrit.wikimedia.org/r/160960 (owner: 10Reedy) [15:16:22] (03PS9) 10Yuvipanda: [WIP] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [15:16:54] (03PS1) 10Ottomata: Add manage_mountat parameter to labs_lvm::volume [puppet] - 10https://gerrit.wikimedia.org/r/160970 [15:17:01] reedy@tin:/srv/mediawiki-staging/multiversion$ mwscript eval.php mediawikiwiki [15:17:01] > var_dump( Title::newFromText( 'wikipedia:Dallas/Fort Worth International Airport' )->isLocal() ); [15:17:01] int(1) [15:17:06] Why is it returning 1? [15:17:07] lol [15:17:48] (03CR) 10Filippo Giunchedi: "yep that's correct, from the host and in private.git" [puppet] - 10https://gerrit.wikimedia.org/r/160927 (owner: 10Filippo Giunchedi) [15:18:14] reedy@tin:/srv/mediawiki-staging/multiversion$ mwscript eval.php mediawikiwiki [15:18:14] > var_dump( Title::newFromText( 'wikipedia:Dallas/Fort Worth International Airport' )->getInterwiki() ); [15:18:14] string(9) "wikipedia" [15:18:14] > reedy@tin:/srv/mediawiki-staging/multiversion$ mwscript eval.php labswiki [15:18:14] > var_dump( Title::newFromText( 'wikipedia:Dallas/Fort Worth International Airport' )->getInterwiki() ); [15:18:15] string(0) "" [15:20:09] andrewbogott_afk: After the rsync step, sync-common rebuilds the l10n cdb files from the json data that the sync brought over. I apparently didn't add any logging when I added that bit. [15:20:21] (03PS2) 10Filippo Giunchedi: metrics: disable SSL virtualhost and cert [puppet] - 10https://gerrit.wikimedia.org/r/160927 [15:20:23] (03PS2) 10Filippo Giunchedi: metrics: move from stat1001 to varnish [puppet] - 10https://gerrit.wikimedia.org/r/160926 [15:20:25] (03PS2) 10Filippo Giunchedi: metrics: point to misc-web-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/160925 [15:20:45] bd808: useless! [15:20:46] :P [15:20:58] * bd808 hangs his head in shame [15:22:17] heh. I even explicitly muted the output that would normally come from scap-rebuild-cdbs [15:22:58] It would look dumb if it wasn't muted though [15:23:33] the outer logger would write each progress meter update as a new line. [15:24:23] If that stuff didn't have to run as a different user that code would be less messy and the logging more awesome [15:30:53] andrewbogott_afk: Found it [15:32:02] Reedy: great! Also, I'm back :) [15:32:39] (03PS1) 10Reedy: Set wgMetaNamespace for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160973 [15:32:48] hmm [15:32:52] (03PS2) 10Reedy: Set wgMetaNamespace for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160973 (https://bugzilla.wikimedia.org/70641) [15:32:54] I need to write something similar to naggen2 for shinken [15:32:58] but probably not for the first pass [15:33:03] (03CR) 10Reedy: [C: 032] Set wgMetaNamespace for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160973 (https://bugzilla.wikimedia.org/70641) (owner: 10Reedy) [15:33:04] which will just do http checks [15:33:08] (03Merged) 10jenkins-bot: Set wgMetaNamespace for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160973 (https://bugzilla.wikimedia.org/70641) (owner: 10Reedy) [15:33:16] andrewbogott: sync-common :) [15:34:02] !log reedy Synchronized wmf-config/InitialiseSettings.php: Set wgMetaNamespace for labswiki (duration: 00m 14s) [15:34:08] Logged the message, Master [15:35:33] Reedy: blue links! [15:35:52] Jeff_Green: you reckon the admin module would be enough for personal accounts needs in https://rt.wikimedia.org/Ticket/Display.html?id=4337 ? [15:36:06] andrewbogott: at worst, a few pages may need purging/null editing [15:36:30] (03CR) 10Physikerwelt: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/160412 (owner: 10Alexandros Kosiaris) [15:36:46] Reedy: thank you! [15:37:17] (03PS2) 10Alexandros Kosiaris: Assign LVS IP address to mathoid [puppet] - 10https://gerrit.wikimedia.org/r/160412 [15:37:21] I guess we should update wikitech-static somewhat at some point [15:37:27] Seeing as it's still 1.23 [15:37:33] (03PS1) 10Manybubbles: Don't double up commonswiki's file shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160974 [15:37:56] how do I retitle a ticket in RT? [15:38:36] YuviPanda: Basics [15:38:41] https://rt.wikimedia.org/Ticket/Modify.html?id= [15:38:51] ty Reedy [15:39:27] Reedy, andrewbogott: You know what would be cool? Making a page on wikitech or elsewhere that tracks all the little tweaks that were needed to add wikitech to the config. It might come in handy again someday. [15:39:44] bd808: megadiff! [15:40:03] Or 'grep labswiki' [15:40:24] heh [15:40:47] I was thinking more of a textbook than an invitation to an archeological dig [15:40:49] (03PS2) 10Ottomata: Put wikimetrics $var_directory in /srv. [puppet] - 10https://gerrit.wikimedia.org/r/160689 [15:41:10] (03PS3) 10Ottomata: Put wikimetrics $var_directory in /srv. [puppet] - 10https://gerrit.wikimedia.org/r/160689 [15:41:17] archeology textbook? [15:41:24] lol [15:41:28] !log manually forcing Cirrus's commonswiki's file index apart from one another in an attempt to lower the consistently high load on elastic1013 [15:41:34] Logged the message, Master [15:42:08] !log gerrit change to lock that into place is https://gerrit.wikimedia.org/r/#/c/160974/ and I'll deploy it in my window in 15 minutes. [15:42:13] Logged the message, Master [15:43:33] (03PS2) 10Manybubbles: Don't double up commonswiki's file shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160974 [15:44:19] (03PS1) 10Manybubbles: Ruwiki gets Cirrus as primary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160975 [15:44:26] Hi akosiaris, is there anything more to do about https://gerrit.wikimedia.org/r/#/c/160412 [15:46:08] (03CR) 10Alexandros Kosiaris: [C: 032] Assign LVS IP address to mathoid [puppet] - 10https://gerrit.wikimedia.org/r/160412 (owner: 10Alexandros Kosiaris) [15:46:47] (03PS4) 10Ottomata: Put wikimetrics $var_directory in /srv. [puppet] - 10https://gerrit.wikimedia.org/r/160689 [15:47:30] (03Abandoned) 10Ottomata: Add manage_mountat parameter to labs_lvm::volume [puppet] - 10https://gerrit.wikimedia.org/r/160970 (owner: 10Ottomata) [15:48:15] (03PS6) 10Ottomata: Move make-instance-vol file into labs_lvm base class [puppet] - 10https://gerrit.wikimedia.org/r/160687 [15:48:31] hmmm, apparently I missed a memo re. RT [15:48:47] (03PS1) 10Cmjohnson: Removing public ip's for aluminium [dns] - 10https://gerrit.wikimedia.org/r/160977 [15:49:10] (03PS5) 10Ottomata: Put wikimetrics $var_directory in /srv. [puppet] - 10https://gerrit.wikimedia.org/r/160689 [15:49:16] bd808: for your editing pleasure: https://wikitech.wikimedia.org/wiki/Wikitech [15:49:33] Jeff_Green: my experience is similar, really hard to track RT stuff via email [15:49:48] andrewbogott: Ha. Fair enough [15:49:51] godog: I mean, I can no longer log in [15:50:05] (03CR) 10Cmjohnson: [C: 032] Removing public ip's for aluminium [dns] - 10https://gerrit.wikimedia.org/r/160977 (owner: 10Cmjohnson) [15:50:29] oh there we go. [15:50:47] (03CR) 10Ottomata: [C: 032 V: 032] Move make-instance-vol file into labs_lvm base class [puppet] - 10https://gerrit.wikimedia.org/r/160687 (owner: 10Ottomata) [15:51:12] PROBLEM - Parsoid on wtp1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:51:13] * bd808 wants betafeatures on wikitech so he can turn on compact personal bar [15:52:02] PROBLEM - Parsoid on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:52:20] (03PS6) 10Ottomata: Put wikimetrics $var_directory in /srv. [puppet] - 10https://gerrit.wikimedia.org/r/160689 [15:52:53] hmmm parsoid hosts are in full CPU [15:53:12] PROBLEM - Parsoid on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:55] (03PS7) 10Ottomata: Put wikimetrics $var_directory in /srv. [puppet] - 10https://gerrit.wikimedia.org/r/160689 [15:54:02] (03PS8) 10Ottomata: Put wikimetrics $var_directory in /srv. [puppet] - 10https://gerrit.wikimedia.org/r/160689 [15:54:12] PROBLEM - Parsoid on wtp1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:54:16] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Parsoid%2520eqiad&tab=m&vn=&hide-hf=false [15:54:42] PROBLEM - Parsoid on wtp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:54:52] PROBLEM - Parsoid on wtp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:04] godog: the admin module sort of fixed some of those issues [15:55:23] the only thing there that it totally fixed was adding individual accounts to servers [15:55:32] PROBLEM - Parsoid on wtp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:56:02] PROBLEM - Parsoid on wtp1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:56:08] PROBLEM - Parsoid on wtp1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:56:11] PROBLEM - Parsoid on wtp1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:56:31] PROBLEM - Parsoid on wtp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:56:41] PROBLEM - Parsoid on wtp1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:11] PROBLEM - Parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:15] (03PS3) 10Manybubbles: Don't double up shards for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160974 [15:57:32] PROBLEM - Parsoid on wtp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:32] PROBLEM - Parsoid on wtp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:39] !log manually pushed apart ruwiki and nlwiki's shards as well - might help - updated commit to reflect that [15:57:41] bd808: Not too hard to add it… [15:57:41] PROBLEM - Parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:45] Logged the message, Master [15:57:51] PROBLEM - Parsoid on wtp1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:58] (03CR) 10Ottomata: [C: 032 V: 032] Put wikimetrics $var_directory in /srv. [puppet] - 10https://gerrit.wikimedia.org/r/160689 (owner: 10Ottomata) [15:58:02] PROBLEM - Parsoid on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:58:16] subbu, gwicke: ^^^ Re. Parsoid cluster woes. [15:58:44] Jeff_Green: ack [15:58:51] PROBLEM - Parsoid on wtp1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:02] PROBLEM - Parsoid on wtp1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:06] PROBLEM - Parsoid on wtp1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:07] grr ... [15:59:10] godog: i commented on the ticket [15:59:52] PROBLEM - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:54] i need help investigating this .. logging onto tin. [16:00:04] manybubbles, ^d: Respected human, time to deploy Search (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140917T1600). Please do the needful. [16:00:12] subbu: look at http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Parsoid%2520eqiad&tab=m&vn=&hide-hf=false CPU/MEM usage started climbing at 15:3X [16:00:13] (03CR) 10Manybubbles: [C: 032] Don't double up shards for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160974 (owner: 10Manybubbles) [16:00:18] Jeff_Green: thanks! [16:00:21] (03CR) 10Manybubbles: [C: 032] Ruwiki gets Cirrus as primary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160975 (owner: 10Manybubbles) [16:00:21] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [16:00:23] (03Merged) 10jenkins-bot: Don't double up shards for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160974 (owner: 10Manybubbles) [16:00:29] (03Merged) 10jenkins-bot: Ruwiki gets Cirrus as primary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160975 (owner: 10Manybubbles) [16:00:52] PROBLEM - Parsoid on wtp1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:00:57] someone forget a puppet-merge or two? [16:01:01] PROBLEM - Parsoid on wtp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:01:02] PROBLEM - Parsoid on wtp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:01:17] let me restart the parsoid service in case it is something stuck. [16:01:33] i'll pull logs after. [16:02:27] on all the boxes? [16:02:57] <_joe_> subbu: I guess all boxes at 100% cpu doesn't look like "something stuck" [16:03:03] <_joe_> did you release something? [16:03:20] no, nothing deployed [16:03:21] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [16:03:31] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.016 second response time [16:03:51] <_joe_> subbu: do not restart on wtp1002 please [16:03:51] RECOVERY - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.028 second response time [16:03:58] <_joe_> oh you did [16:04:15] oops. i did a full restart. [16:04:21] RECOVERY - Parsoid on wtp1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.008 second response time [16:04:32] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.013 second response time [16:04:32] i stopped it .. [16:04:45] well CPU wise it looks better now [16:04:52] PROBLEM - LVS HTTP IPv4 on parsoid-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:04:57] <_joe_> akosiaris: :P [16:04:58] wtp1024 looks un-restarted [16:05:17] _joe_, they may not all have restarted [16:05:25] RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.026 second response time [16:05:54] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:05:54] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [16:06:01] !log manybubbles Synchronized wmf-config/: set cirrus as primary search backend for ruwiki and make permanent some settings set on the fly (duration: 00m 06s) [16:06:07] Logged the message, Master [16:06:17] where does the nodejs code log at? [16:06:26] /var/log/parsoid/ [16:06:55] RECOVERY - LVS HTTP IPv4 on parsoid-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 1285 bytes in 0.008 second response time [16:06:59] /var/log/parsoid/parsoid.log [16:07:14] log being an overstatement. It is the console output in reality [16:07:18] hrm. did something change ops-wise at 11:51:12 AM? that's when the parsoid alerts started. [16:07:37] cscott: which timezone ? [16:07:39] <_joe_> cscott: 11:51Z? [16:07:44] EST, of course. [16:07:50] or EDT. whatever. [16:07:50] <_joe_> oh thanks [16:07:57] <_joe_> in UTC? [16:08:14] ;) i was just scrolling back to try to see if there was any channel chatter around the start of the parsoid ALERTs [16:08:32] those logs are awful :p [16:08:35] and the CPU/MEM rising started at 15:36 UTC [16:08:36] <_joe_> oh ok, I don't think we did anything [16:08:38] like, "hey, watch this!" ;) [16:08:51] they need timestamps and pids and such at least [16:09:02] <_joe_> and as akosiaris said, it was a progressive crippling effect [16:09:10] bblack: again, not really logs [16:09:15] console output [16:09:17] semantics? [16:09:19] hm, i recently landed a bunch of changes to ocg (which mwalker originally wrote) to redirect all it's logging to winston. i could probably do the same for parsoid. [16:09:23] if they're not logs, where are the logs? :) [16:09:24] bblack, yes ...console output .. and we need to get on fixing those. [16:09:33] but I think the parsoid team has a plan to actually start logging [16:09:37] ok [16:09:51] <_joe_> oh, well, it's surely a nice-to-have [16:10:16] subbu: we do "really log" somewhere, don't we? it's what we've had our interns working on for a while. or haven't we turned everything on yet? [16:10:17] afaik there is a WIP patch even [16:10:36] (03PS1) 10Cmjohnson: Adding private entry for aluminium [dns] - 10https://gerrit.wikimedia.org/r/160981 [16:10:58] * _joe_ should use the tag sometimes [16:11:09] well most boxes are at full CPU still [16:11:13] at least according to ganglia [16:11:14] You should be sarcastic? :D [16:11:28] i stopped restart since _joe wanted me not to restart everything. [16:11:49] well 2-3 boxes should be more than enough [16:12:04] the rest I think we can restart [16:12:09] _joe_: anyway, it would be around 15:50 UTC, which is just *before* manybubbles said "synchronized wmf-config" according to https://wikitech.wikimedia.org/wiki/Server_Admin_Log. so he's innocent. ;) [16:12:14] were there any changes or restarts in the job queue? [16:12:16] gwicke, do yo know how to restart just a subset of boxes? [16:12:28] subbu: dsh [16:12:28] no reason to have a degraded service [16:12:35] dsh -g parsoid sudo service parsoid restart will restart everything [16:12:38] <_joe_> akosiaris: but look at the network output [16:12:38] cscott: I no break things! [16:12:41] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=wtp1003.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1410970339&g=mem_report&z=large&c=Parsoid%20eqiad [16:12:44] <_joe_> I think most jobs were in queue [16:12:47] memory ran away together with the CPU [16:12:48] <_joe_> so [16:12:55] <_joe_> cpu rising again but due to load [16:13:11] <_joe_> before that, something fishy was going on within parsoid [16:13:22] the actual problem started at :30 [16:13:23] we no have fish! [16:13:30] it just took a while for memory to become fully-consumed [16:13:31] * cscott tips his hat to manybubbles [16:13:39] <_joe_> like, something was blocking the non-blocking server [16:13:54] subbu: I will do it. I will not restart wtp1022, wtp1023, wtp1024 [16:13:58] akosiaris, thanks. [16:14:20] <_joe_> akosiaris: are you restarting everywhere else? [16:14:26] (03CR) 10Cmjohnson: [C: 032] Adding private entry for aluminium [dns] - 10https://gerrit.wikimedia.org/r/160981 (owner: 10Cmjohnson) [16:14:27] _joe_: yes [16:14:30] <_joe_> ok [16:14:34] I suppose those 3 are enough to debug this [16:14:39] <_joe_> yep [16:14:41] <_joe_> I agree [16:14:56] <_joe_> btw we're handling normal load with basically 3 servers right now [16:15:04] <_joe_> 4 [16:15:14] definitely a memory issue [16:15:15] I was about to say the same [16:15:31] all the processes are locked up on futex or epoll_wait it seems? [16:15:35] <_joe_> about the memory issue or what? [16:15:44] <_joe_> bblack: that's normal in node I guess [16:15:53] oh wait each process is threaded, too [16:15:55] yes, that's normal [16:15:56] <_joe_> if something upstream is non responding [16:16:14] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.022 second response time [16:16:16] basically once heap gets close to 1.8g GC uses almost all cpu time [16:16:29] the root cause is having such a huge heap [16:16:35] <_joe_> JAVAscript [16:16:36] normal is 200-300m per process [16:16:45] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.004 second response time [16:16:46] gwicke, looking through logs to see what page might have cuased this. [16:16:47] so something in the Monday deploy is likely wrong [16:16:55] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.004 second response time [16:17:05] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.013 second response time [16:17:05] _joe_: hehe, 1.8G is *small* ;) [16:17:07] well the 4-5 secs that it said right before the alerts is bad then [16:17:14] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.007 second response time [16:17:14] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.005 second response time [16:17:23] by JAVA standards [16:17:24] <_joe_> gwicke: my comment was about GC sending the VM the way of the dodo [16:17:39] <_joe_> and no, it's not that small even by java standards [16:17:42] <_joe_> :) [16:17:45] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.019 second response time [16:17:47] ah, that's true of any GC doing full collections over and over [16:17:54] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.011 second response time [16:17:56] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.007 second response time [16:17:56] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.005 second response time [16:17:56] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.021 second response time [16:18:05] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.025 second response time [16:18:15] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.010 second response time [16:18:18] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.212 second response time [16:18:18] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.187 second response time [16:18:21] manybubbles might sneeze at 1.8G heaps ;) [16:18:27] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.012 second response time [16:18:29] gwicke: weak [16:18:36] 30GB is pretty good [16:18:42] there you go ;) [16:19:04] a bit above that and the VM can't do pointer compression tricks so you end up taking up much more space [16:20:01] in node the ~2G limit is also caused by using some bits of a 32-bit pointer internally for tagging [16:20:45] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.025 second response time [16:21:21] must be just 1 bit [16:21:54] these are 64-bit machines anyways, so why are 32-bit pointers being used? :p [16:21:58] for size [16:22:09] and it's one more bit than the allocation granularity [16:22:44] 2G is kinda small [16:22:50] they don't have a flag to go 64-bit? [16:22:52] without timestamps, hard to figure out the title(s) that triggered this .. have to match up start/end parsing lines ... [16:23:07] !log caused cirrus brownout by executing a force merge for enwiki's general index. ooops [16:23:12] Logged the message, Master [16:23:13] i'm stepping in a debugger on wtp1024 [16:23:14] not that it would help us anyways [16:23:37] but it seems a strange architectural choice to limit themselves to 2G of memory [16:23:45] !log restarted node on wtp boxes except wtp1022,wtp1023,wtp1024 [16:23:50] Logged the message, Master [16:24:02] bblack: yeah - jvm does the same thing - because objects often contain pointers to other objects it is a pretty steep space savings to use 32 bit pointers [16:24:22] heh [16:24:53] "The solution to the over-use of objects and pointers is to reduce the pointer size to save memory, and in the process massively limit the total memory available in the first place" [16:25:01] yeah, 64-bit in not a panacea. [16:25:18] dammit… Reedy, we broke something in wikitech; the dynamic sidebar doesn't load anymore [16:25:29] andrewbogott: Wasn't it broken before? [16:25:38] I thought you tested it and noticed it was already broken on "old" wikitech [16:25:39] Um… maybe? Not this broken. [16:25:47] and most apps don't need 2G of contiguous memory. linux can allocate lots of "32-bit pointer" processes in your 64-bit memory space, which is often more useful than a single process hogging it all. [16:25:53] It didn't have disclosures, but it still displayed the content properly. [16:25:57] Now there's no content at all. [16:26:02] Just the boilerplate mw sidebar [16:26:05] cscott: I get all that, but the VM should still have the option to go 64 [16:26:23] bblack: it's optimized for small heaps [16:26:25] --make-my-app-slower-but-give-me-big-pointers-dammit [16:26:47] making big heaps GC with small pause times is hard [16:26:48] sometimes that's what you need [16:26:56] there are actually substantially different VM choices you make. it's not really a flag, it's a whole different VM. [16:27:07] the jvm has client, server, and some other different variants to handle this. [16:27:08] bblack: the jvm certainly has that option. but as gwicke said small pause times become a problem. [16:27:08] arguably GC and over-use of OO and pointers is the problem, not the pointer width, but that's a whole other ball of wax [16:27:30] the log on wtp1022 is static .. so, i am looking at titles at the end and parsing locally to see if any of them trigger this behavior. [16:28:16] x32 ABI is an interesting thing here too [16:28:24] but I guess distros don't widely support it yet [16:29:28] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce mathoid LVS IP [dns] - 10https://gerrit.wikimedia.org/r/159996 (owner: 10Alexandros Kosiaris) [16:29:49] bblack: in node you normally use multiple processes and message passing for cpu parallelism and robustness [16:30:10] Reedy: it's pretty serious, basically everything to do with labs is controlled via the sidebar [16:30:37] gwicke: that still has nothing to do with total memory requirements, really. 2G is an arbitrary limit, and the things you're talking about are scale-invariant [16:30:44] gwicke, cscott i have a candidate title that locks up parsoid [16:30:47] some data sets are large :p [16:30:54] huwiki/Vegyületek_összegképlet-táblázata [16:31:12] ^ that was the last title in the log on the first node I looked at [16:31:19] !log just going to make this clear - the current cirrus brownout doesn't seem to be effecting my queries but we're getting hit with pool counter full events - sadness. its not caused by switching cirrus to ruwiki's primary backend - its caused by me attempting to perform index maintenance activities. [16:31:23] that certainly locks up my brain's pronunciation algorithm [16:31:25] bblack: yup, best to use C++ for those ;) [16:31:26] Logged the message, Master [16:31:43] switch languages because your data got bigger? :p [16:31:44] cscott: huwiki/HungarianWords [16:31:52] bblack, ah, ok. :) [16:32:05] bblack: even the Java folks are always trying to get stuff off the heap [16:32:38] logging out of wtp1022 onw [16:33:03] I don't like java much, either :) [16:33:06] cscott, gwicke i also pulled all logs onto bast1001 .. it is in ~ssastry/logs btw [16:33:07] there are actually some very fun VM things you can do with 64-point pointers. but, like i said, you basically have to write your VM ground up around the pointer size. and performance usually suffers in 64-bit. [16:33:30] subbu: hm, a very large sortable table. [16:33:50] subbu: quite possibly a O(N^2) lookahead issue in the tokenizer? [16:33:53] it's not inconceivable that a VM can have one codebase and support multiple pointer widths via re-(jit)-compilation [16:33:58] maybe --> #parsoid? [16:34:19] bblack: the JVM does that [16:34:21] subbu: yup. [16:34:21] afaik [16:34:49] "performance really suffers in 64-bit" isn't really about 64-bit, it's about languages designed around patterns of over-use of OO and pointers and GC [16:36:03] but still, the whole point of higher-level language abstractions is to get away from thinking about these details. the programmer shouldn't have to even think about whether his data fits in 2GB or not. Maybe the deployer with a runtime flag for the VM. [16:36:26] bblack: there's always C or C++ [16:37:23] fundamentally GC just doesn't scale to extreme heap sizes [16:37:26] gwicke: physikerwelt, btw sca1001, sca1002 are ready for their first mathoid deployment [16:37:27] andrewbogott: Did it break with me fixing the interwiki links? [16:37:37] gwicke, i think we can get the rest of the nodes restarted as well? [16:37:43] subbu: yes [16:37:45] Reedy: I think so. [16:37:50] subbu: doing so.. [16:37:54] k. thanks. [16:38:06] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.513333333333 [16:38:17] gwicke, can you also add notes to the deployment page how to restart a subset of nodes (vs. restarting everything)? [16:38:32] subbu: it's just the same command [16:38:35] Reedy: I'm reading through the config, not learning much so far [16:38:38] except run on individual nodes [16:39:00] sudo service parsoid restart [16:39:02] icinga-wm: yeah - I bet [16:39:45] k [16:39:55] RECOVERY - Parsoid on wtp1024 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.016 second response time [16:40:22] ssh to boxes one at a time and restart ... i thought there was some other way to do it via dsh. [16:40:27] (03PS1) 10Alexandros Kosiaris: nfs cleanups [puppet] - 10https://gerrit.wikimedia.org/r/160984 [16:40:29] Reedy: I turned on debugging, and I see soem log lines from dynamicsidebar. So it is getting loaded. My guess is it can't find its content... [16:41:07] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.007 second response time [16:41:31] Content should come from here: https://wikitech.wikimedia.org/wiki/MediaWiki:Sidebar [16:42:05] $title = Title::makeTitle( NS_MEDIAWIKI, 'Sidebar/Group:' . $group ); [16:42:23] !log restarted parsoid on wtp102{2,3,4} [16:42:30] Logged the message, Master [16:43:05] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.029 second response time [16:43:10] Reedy: that's from the dynamicsidebar source? [16:43:14] yeah [16:43:27] And we just changed the value of NS_MEDIAWIKI? [16:43:35] https://wikitech.wikimedia.org/w/index.php?title=Special%3APrefixIndex&prefix=Sidebar%2FGroup&namespace=8 [16:43:51] https://wikitech.wikimedia.org/wiki/MediaWiki:Sidebar/Group:cloudadmin [16:44:24] * andrewbogott really wishes he could open links [16:45:31] Reedy: so, yeah, that's where the content should come from... [16:45:44] but I'm thick and not yet putting the pieces together [16:52:54] andrewbogott: Hmm. Did I switch wikitch to 1.24wmf21 too? [16:53:23] No idea why that'd break the extension... [16:53:26] hm, looks like [16:53:32] James_F: you has mail [16:55:37] Reedy: I believe that the sidebar behavior now is the same as the old behavior when I wasn't logged in. Dunno if that's relevant. [16:56:38] What does the debug output say? [16:58:06] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.646666666667 [16:58:08] Just 2014-09-17 16:54:16 virt1000 labswiki: Entering modifySidebar [16:58:13] each time I load a page [16:59:28] It should give self::printDebug( "Using group sidebar" ); too [16:59:35] (03PS1) 10Manybubbles: Get back fewer stats to lower overhead [puppet] - 10https://gerrit.wikimedia.org/r/160987 [16:59:50] speaking of sidebar, can we add the releng SAL? [16:59:53] (03CR) 10Manybubbles: "I think we should try this!" [puppet] - 10https://gerrit.wikimedia.org/r/160987 (owner: 10Manybubbles) [17:00:14] Reedy: yeah, it should. I'm going to twiddle some settings [17:00:25] (03CR) 10jenkins-bot: [V: 04-1] Get back fewer stats to lower overhead [puppet] - 10https://gerrit.wikimedia.org/r/160987 (owner: 10Manybubbles) [17:01:27] (03PS2) 10Manybubbles: Get back fewer stats to lower overhead [puppet] - 10https://gerrit.wikimedia.org/r/160987 [17:02:15] Just spotted a missing message [17:02:16] (03CR) 10Manybubbles: "Code was too wide to read on 40 year old monitors. Had to fix it." [puppet] - 10https://gerrit.wikimedia.org/r/160987 (owner: 10Manybubbles) [17:02:41] https://bugzilla.wikimedia.org/show_bug.cgi?id=67852 [17:02:44] already reported [17:03:19] manybubbles: looks good, I can merge right away [17:03:25] godog: sure! [17:03:34] I think it'll lower user cpu usage [17:03:36] which is good [17:03:37] (03CR) 10Filippo Giunchedi: [C: 031] Get back fewer stats to lower overhead [puppet] - 10https://gerrit.wikimedia.org/r/160987 (owner: 10Manybubbles) [17:03:47] James_F: i think Coren can add himself :-) [17:03:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Get back fewer stats to lower overhead [puppet] - 10https://gerrit.wikimedia.org/r/160987 (owner: 10Manybubbles) [17:04:06] manybubbles: yep can't hurt [17:04:30] !log cirrus brownout looks just about fixed. So! My plan for periodically explicitly merging deletes has some problems..... [17:04:30] {{done}} [17:04:35] Logged the message, Master [17:07:23] Reedy: it looks to me like the arg &$sidebar passed in to the extension is empty. I don't quite understand how hooks and args work, though, so not sure where to look next. [17:07:44] Uhhh...etherpad down? [17:07:52] Nope, just a hiccup [17:09:21] godog: thanks! [17:09:42] whenever I looked at the hot threads I saw a ton of those shards stats things [17:10:40] andrewbogott: It comes from Skin.php [17:12:52] andrewbogott: Might be worth trying to revert to 1.24wmf20 and see if that fixes it [17:13:23] hm, ok [17:14:12] Of course that just defers the problem for a few days [17:15:00] yeah [17:15:10] skin code might've been broken between the 2 [17:18:00] James_F: also, if greg-g is listed as CC and you CC him in your mail client too then he gets 2 copies :-) [17:19:19] afk for fooood [17:20:19] jeremyb: James_F it's true, I did :) [17:20:47] (03PS1) 10Andrew Bogott: Move labswiki back to wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160989 [17:22:42] springle: what's with all the virt0 db errors? [17:22:56] AaronSchulz: dont' worry about it, I'm killing off virt0 shortly [17:23:12] (03CR) 10Andrew Bogott: [C: 032] Move labswiki back to wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160989 (owner: 10Andrew Bogott) [17:23:20] (03Merged) 10jenkins-bot: Move labswiki back to wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160989 (owner: 10Andrew Bogott) [17:23:54] !log andrew Synchronized wikiversions.json: (no message) (duration: 00m 05s) [17:24:00] Logged the message, Master [17:25:15] hm… bd808, after I change the version of a wiki do I need to run something other than sync-file? [17:26:01] andrewbogott: sync-wikiversions [17:26:12] ok... [17:26:13] !log andrew rebuilt wikiversions.cdb and synchronized wikiversions files: (no message) [17:26:19] Logged the message, Master [17:26:22] I'm getting back into the 'break all of wikipedia' realm today :( [17:26:43] * andrewbogott would be happier without a tin login [17:27:02] * bd808 wants to make deployment an operational concern [17:27:44] nooooooo [17:27:55] Reedy: Reverting to 20 didn't make a difference. [17:28:07] bd808: want to take a shift helping me with this dumb problem? [17:28:15] (which was caused by fixing a different dumb problem) [17:28:31] which one? The sidebar thing? [17:28:34] yeah [17:29:16] So what happened? The custom sidebar code isn't being loaded now? [17:29:58] The symptom is… after the sidebar isn't customized for wikitech anymore. [17:30:08] the dynamicsidebar extension is loading [17:30:11] but it doesn't see any content. [17:30:16] heh. https://wikitech.wikimedia.org/wiki/MediaWiki:Sidebar is there [17:30:31] * bd808 looks at gerrit for "reedy magic" [17:31:15] Yep, the content is there. I suspect that the sidebar callback is passing in a mangled path due to [17:31:32] you think setting $wgMetaNamespace did it? [17:32:12] hmm... have you tried making a page at https://wikitech.wikimedia.org/wiki/Wikitech:Sidebar [17:32:22] Yes, or https://gerrit.wikimedia.org/r/#/c/160966/ [17:33:04] (03PS1) 10RobH: setting install params for ms-be2001-2011 [puppet] - 10https://gerrit.wikimedia.org/r/160992 [17:33:34] bd808: https://wikitech.wikimedia.org/wiki/Wikitech:Sidebar seems not to have changed anything [17:34:06] andrewbogott: Ok. random guess anyway [17:34:13] good idea [17:34:13] I'll look a bit [17:34:17] thank you [17:34:32] I'm sorry this project has been such a tarpit for you guys [17:34:40] (03CR) 10RobH: [C: 032] setting install params for ms-be2001-2011 [puppet] - 10https://gerrit.wikimedia.org/r/160992 (owner: 10RobH) [17:34:57] :) I knew going in I was fibbing about how easy it would be [17:35:41] (03PS1) 10Yuvipanda: labmon: Kill CPU usage monitors for toollabs [puppet] - 10https://gerrit.wikimedia.org/r/160993 [17:35:44] Coren: ^ [17:36:13] YuviPanda: Not just CPU usage. That repeated warning for space on /var on labmon1001 is also broken. [17:36:39] let me actually investigate that, since everytime I see it it's actually really low [17:36:55] I'll disable it if I can't figure it out in an hour or so [17:37:25] YuviPanda: How is it "really low"? I've never seen it go below 40G free. [17:37:31] Coren: 40G? [17:37:34] /var? [17:37:49] labmon1001 doesn't have a separate /var [17:38:03] Coren: oh, they're for toollabs hosts, not for labmon1001 [17:38:06] they just run on labmon1001 [17:38:10] flood of labswiki memcached errors too [17:38:27] Coren: and we can't configure the hostname properly with our current setup. [17:38:34] AaronSchulz: on virt0? [17:38:45] Coren: it's tools-webproxy that mostly triggers the _var errors [17:38:48] YuviPanda: Then that's reason enough to turn it off. It's useless, and spammy. [17:39:06] Coren: it's not useless. There were *several* times when hosts died because /var filled up on tools [17:39:18] jeremyb, greg-g: Whoops. Sorry! [17:39:31] Sure, but what's the use of an alarm that tells you "something is broken, somewhere, but I'm not telling"? [17:39:43] Coren: it does tell you [17:39:46] * bd808 eyes wgSidebarCacheExpiry suspiciously [17:39:51] PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (77.78%) WARN: tools.tools-submit.cpu.total.user.value (55.56%) [17:39:54] "PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (30.00%)" [17:40:05] Coren: right, that's because tools-webproxy's hostname is tools [17:40:08] for some random reason [17:40:58] Hmmm. Yeah, okay. That /vaguely/ makes sense. [17:41:30] (03CR) 10coren: [C: 032] "Yes, please stfu." [puppet] - 10https://gerrit.wikimedia.org/r/160993 (owner: 10Yuvipanda) [17:42:16] (03CR) 10Dzahn: "duplicate of Change-Id: Ibbe0a4f209422a4a ?" [puppet] - 10https://gerrit.wikimedia.org/r/159737 (owner: 10RobH) [17:42:21] Coren: this will be nicer once shinken works well, but that's going to be a while [17:42:28] andrewbogott: yep, see the memcached-serious log on fluorine [17:42:57] bd808: is it possible we just have an empty cache entry? I tried $wgEnableSidebarCache = false; and that didn't seem to matter [17:43:39] AaronSchulz: what's the /full/path.log for that? [17:44:01] andrewbogott: Apparently it doesn't matter for authed users with that plugin installed -- https://github.com/wikimedia/mediawiki-extensions-DynamicSidebar/blob/master/DynamicSidebar.body.php#L12-L14 [17:44:35] ah, great. So, probably not a bad cache [17:46:42] But I don't think it's DynamicSidebar's problem. If it was broken there would just be junk in the MediaWiki:Sidebar content [17:46:57] And this looks like MediaWiki:Sidebar is being ignored completely [17:47:01] andrewbogott: /a/mw-log/* [17:47:12] oh, of course, /a [17:47:19] ;) [17:48:41] hm, I restarted memcached on virt1000 and that made things much noisier for a moment… but solved nothing [17:49:30] well, at least it looks like those complaints predate our immediate problem [17:50:11] shit, bd808, restarting memcached fixed everything [17:50:24] Reedy: ^ [17:50:43] hah. bad cache is bad [17:51:11] That's not ordinarilly a part of deploying is it? restarting the cache? [17:51:19] 'cause on wikitech it breaks all kinds of session things [17:51:36] No, but switching from one db group to another isn't "normal" either [17:52:02] fluroine is still full of these: labswiki: Memcached error: Error connecting to 127.0.0.1:11211: Connection refused [17:52:09] Unrelated, predating today's changes [17:52:22] I presume there's some disagreement about port # [17:52:40] …but perhaps I will choose not to care about this today [17:53:19] that is something not using the proxy port. Not sure what [17:53:56] 11211 is the default memcached port. we run twmproxy on 11212 [17:54:18] So maybe some other service is running on that box and didn't get the message [17:55:09] Something besides $wgObjectCaches['memcached-pecl'] hitting memcached from the wiki? [17:55:48] Is the "old" wiki still up somehow? [17:56:45] I'm pretty sure I see one of those errors everytime I do a wikitech page load [18:00:35] I don't see that we set wgMemCachedServers in mediawiki-config. Something could conceivably be creating a MemcachedBagOStuff instead of grabbing a cache from wgObjectCaches [18:01:13] The default for wgMemCachedServers is 127.0.0.1:11211 [18:02:30] what should I grep for to look for examples of MemcachedBagOStuff? [18:02:34] presumably not that :) [18:02:49] Reedy: Tons of log messages from MassMessage saying "Disabling contenthandler features because $wgContentHandlerUseDB = false" [18:02:55] Seems not good [18:03:01] It's purposeful [18:03:10] Though, it probably doesn't need to log every time [18:03:13] legoktm: ^^ [18:03:14] oh. 826k in last hour :) [18:03:29] lol oops. [18:03:33] lets just remove that... [18:04:24] legoktm: centralauth patch going in today ? [18:04:33] Reedy: https://gerrit.wikimedia.org/r/160997 [18:04:46] matanya: er, which patch? [18:04:54] hoo's [18:05:44] Would be nice to get that in before code freeze tomorrow [18:05:52] oh https://gerrit.wikimedia.org/r/#/c/160526/ ? [18:06:03] * legoktm reviews [18:06:03] andrewbogott: That's actually the class name :) It seems like we should override the default in the generic mediawiki-config settings anyway. [18:06:13] legoktm: :) Stewards will love you [18:07:19] we already do [18:07:27] More :D [18:09:48] I'm reading through it, and do we still need to allow people to delete global accounts? [18:10:22] legoktm: Not after user merge, I guess [18:10:45] same should go for unattach [18:11:03] but I chose to let it be there for now, to make review faster :P [18:12:42] (03PS1) 10BryanDavis: Set wgMemCachedServers to point to nutcracker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161005 [18:13:09] Reedy: ^ seem reasonable? [18:13:36] (03PS2) 10Hoo man: Set wgMemCachedServers to point to nutcracker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161005 (owner: 10BryanDavis) [18:13:41] typo tpyo typo [18:13:54] on second thought, we need the delete interface so I can test how unattached accounts behave :P [18:14:07] I'd be intrigued to find out what's accessing it badly [18:14:09] :D [18:14:23] legoktm: I keep messing with my DB to test stuff like that [18:14:26] bad habits... [18:14:33] it's a total mess these days :D [18:14:47] if you were using vagrant... [18:14:59] (NB, I'm not either) [18:15:00] Reedy: I would as well, should we add some logging that is a little more informative? [18:15:17] legoktm: Thanks for the merge... awesome :) [18:15:27] hoo: I think we can close https://bugzilla.wikimedia.org/show_bug.cgi?id=23391 now? [18:15:29] * hoo will write an email to stewards-l in time [18:15:57] legoktm: Nice... yes [18:16:08] comment #0 is fulfilled [18:16:17] (03PS10) 10Yuvipanda: [WIP] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [18:17:53] (03Abandoned) 10RobH: adding star.wmfusercontent.org.pem and GlobalSign.pem [puppet] - 10https://gerrit.wikimedia.org/r/159737 (owner: 10RobH) [18:18:13] If only we had structured logging for wfDebugLog... ;) [18:18:15] Reedy: ori_ hey, looks like maybe the docroot change broke beta now? http://en.wikipedia.beta.wmflabs.org/ [18:18:34] That looks pretty borked [18:18:51] Reedy: also, /join #wikimedia-qa, plz ;) [18:18:51] hhvm not parsing as php [18:18:55] hoo: if you need any review, poke, i speed it a bit :D [18:19:23] :) Despite of that it's only legoktm needing reviews from me :P [18:20:31] hoo|away: https://gerrit.wikimedia.org/r/#/c/160907/ !!! [18:20:34] (03CR) 10BryanDavis: "It may be a better idea to add a stracktrace to MemcachedClient::_error_log so we can see the caller in such instances." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161005 (owner: 10BryanDavis) [18:23:00] !log phabricator - made aklapper an admin [18:23:07] Logged the message, Master [18:25:51] Who knows about redis? YuviPanda maybe? [18:25:57] hmm? [18:25:58] Right now virt1000 is a 'slaveof' virt0 [18:25:59] (03PS1) 10Yurik: Added zero portal impersonator user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161010 [18:26:12] And I want to just turn that off. Remove the 'slaveof' line entirely, and then make another box a 'slaveof' virt1000 [18:26:34] andrewbogott: slaveof no one [18:26:36] Think that'll just work? Or will virt1000 forget everything it knows as soon as I sever its contact with virt0? [18:27:04] andrewbogott: nope, won't lose data [18:27:10] > The form SLAVEOF NO ONE will stop replication, turning the server into a MASTER, but will not discard the replication. So, if the old master stops working, it is possible to turn the slave into a master and set the application to use this new master in read/write. Later when the other Redis server is fixed, it can be reconfigured to work as a slave. [18:27:24] Hey, that's just what I want it to do! [18:27:25] woo redis [18:28:16] andrewbogott: :D [18:28:19] thanks [18:28:31] andrewbogott: yw :) [18:29:16] (03CR) 10Yurik: [C: 032] Added zero portal impersonator user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161010 (owner: 10Yurik) [18:29:20] (03Merged) 10jenkins-bot: Added zero portal impersonator user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161010 (owner: 10Yurik) [18:30:57] (03CR) 10Yurik: [C: 032] Updated login-logout whitelisted pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159926 (owner: 10Yurik) [18:31:06] (03Merged) 10jenkins-bot: Updated login-logout whitelisted pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159926 (owner: 10Yurik) [18:31:25] PROBLEM - CI tmpfs disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 29 MB (5% inode=99%): [18:32:17] (03PS1) 10Andrew Bogott: Virt1000 is no longer a redis slave of virt0. [puppet] - 10https://gerrit.wikimedia.org/r/161012 [18:32:19] (03PS1) 10Andrew Bogott: Turn on dns and ldap on labcontrol2001. [puppet] - 10https://gerrit.wikimedia.org/r/161013 [18:35:31] Coren, have time to talk through https://gerrit.wikimedia.org/r/#/c/161013/ with me? [18:37:53] (03CR) 10Andrew Bogott: "Ldap:" [puppet] - 10https://gerrit.wikimedia.org/r/161013 (owner: 10Andrew Bogott) [18:38:30] (03CR) 10Andrew Bogott: "As for DNS... I want to remove virt0 and add labcontrol2001 as a secondary for virt1000 and add virt1000 as a secondary for labcontrol2001" [puppet] - 10https://gerrit.wikimedia.org/r/161013 (owner: 10Andrew Bogott) [18:40:11] !log yurik Synchronized wmf-config/: private wikis login/logout page names, zeroportal impersonator acct (duration: 01m 06s) [18:40:18] Logged the message, Master [18:43:41] !log yurik Synchronized php-1.24wmf21/extensions/: update to JsonConfig, ZeroBanner, ZeroPortal (duration: 01m 35s) [18:43:46] Logged the message, Master [18:44:26] (03PS2) 10Andrew Bogott: Turn on dns and ldap on labcontrol2001. [puppet] - 10https://gerrit.wikimedia.org/r/161013 [18:44:31] (03PS1) 10Yuvipanda: labmon: Add basic graphite based monitoring for contint [puppet] - 10https://gerrit.wikimedia.org/r/161015 [18:44:44] Krinkle: ^ [18:45:04] (03CR) 10jenkins-bot: [V: 04-1] Turn on dns and ldap on labcontrol2001. [puppet] - 10https://gerrit.wikimedia.org/r/161013 (owner: 10Andrew Bogott) [18:45:12] Krinkle: right now it's only monitoring based on graphite values, so look at graphite and tell me if you want monitors for anything else [18:45:13] (03CR) 10Andrew Bogott: "ok, dns replication should be fixed in ps2. Maybe ldap replication too, unclear" [puppet] - 10https://gerrit.wikimedia.org/r/161013 (owner: 10Andrew Bogott) [18:45:32] YuviPanda: k, will take a look now [18:45:37] Krinkle: cool [18:46:01] (03PS3) 10Andrew Bogott: Turn on dns and ldap on labcontrol2001. [puppet] - 10https://gerrit.wikimedia.org/r/161013 [18:47:13] !log yurik Synchronized php-1.24wmf20/extensions/: update to JsonConfig, ZeroBanner, ZeroPortal (duration: 01m 39s) [18:47:19] Logged the message, Master [18:48:01] (03PS4) 10Andrew Bogott: Turn on dns and ldap on labcontrol2001. [puppet] - 10https://gerrit.wikimedia.org/r/161013 [18:49:05] (03CR) 10Dzahn: [C: 032] labmon: Add basic graphite based monitoring for contint [puppet] - 10https://gerrit.wikimedia.org/r/161015 (owner: 10Yuvipanda) [18:50:23] mutante: eh, I was going to give feedback first. Yuvi just asked [18:50:35] YuviPanda: mutante: Is there an acknowledgement thing? e.g. puppet currently has 1 failure. [18:50:48] what is the frequency otherwise? [18:50:51] Krinkle: I can add more patches :) [18:50:56] Krinkle: you can acknowledge at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon [18:51:00] Krinkle: yes, there is, but it will need another patch to give you a login that has permissions [18:51:05] Krinkle: it emails you only for a state change [18:51:13] mutante: oh, I thought everyone with wmf group had perms? [18:51:31] YuviPanda: logging in and executing commands are separate things [18:51:36] mutante: oh, ok [18:51:44] need to add him to the icinga config [18:52:00] and that user must match labs LDAP user [18:52:09] Krinkle: either way, you'll get emails only for state change (ok to critical, etc) [18:52:30] Yeah, I can log in but it says Not Authorised if I e.g. try to add a comment to beta labs puppet freshness [18:53:00] that's because of [18:53:06] files/icinga/cgi.cfg [18:53:24] YuviPanda: /puppet$ grep authorized files/icinga/cgi.cfg [18:53:26] YuviPanda: For integration, I'd like /mnt checks as well for integration-slave* nodes [18:53:32] since that's where Jenkins runs [18:53:38] Krinkle: cool [18:53:56] I'm previewing the search sgtrings from that patch on graphite.wmflabs.org [18:53:57] sweet [18:54:19] mutante: service commands and host commands only, I suppose? [18:54:24] YuviPanda: I guess it does a max() to find the worst? Because it doesn't have entries for separate nodes. [18:54:28] Krinkle: fyi http://docs.icinga.org/latest/en/objectdefinitions.html#contact [18:54:31] PROBLEM - Disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 18 MB (3% inode=99%): [18:55:03] host_notification_options d,r,f [18:55:03] service_notification_options c,r,f [18:55:04] Krinkle: nope, I wrote python code to treat them individually as separate series. check_graphite in the puppet repo [18:55:25] YuviPanda: Ah, ok. So the * is expanded before it goes to graphite [18:55:25] cool [18:55:28] Krinkle: you can see the raw JSON by copying the URL for a graph, and adding format=json [18:55:32] Krinkle: nope, it's expanded by graphite :) [18:55:38] the JSON has segmented info per host [18:55:47] * Krinkle looks at lanthanum crit [18:57:37] (03PS1) 10Yuvipanda: labmon: Add contint monitoring for /mnt space as well [puppet] - 10https://gerrit.wikimedia.org/r/161019 [18:57:44] RECOVERY - Disk space on lanthanum is OK: DISK OK [18:57:44] RECOVERY - CI tmpfs disk space on lanthanum is OK: DISK OK [18:57:53] Krinkle: ^ monitoring for /mnt [18:59:43] !log disabled icinga alerts for ms-be1001, rebooting it to look at its raid bios settings for codfw deployment mirroring [18:59:49] Logged the message, Master [19:00:14] !log jenkins-slave tmpfs on lanthanum was filling up (> 500MB). I purged tmp dbs for old jobs. We should get these purged automatically and also increase the size as 500MB is too little. [19:00:21] Logged the message, Master [19:00:30] Coren: nm, I think I've sorted out my main questions. Still welcome your review of https://gerrit.wikimedia.org/r/#/c/161013/ though. [19:00:32] * andrewbogott relocating [19:03:22] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:05:29] ok, that is interesting. [19:05:57] cmjohnson1 / godog : ms-be1001 is indeed configured as a bunch of raid0 single disk arrays. [19:06:08] (i moved conversation to here cuz well, ops) [19:06:11] YuviPanda: Cool. /mnt is 80GB and jobs workspaces can be quite large (a single mediawiki-core is > 250 MB). Let's put initial warning at 16GB and crit at 1GB? [19:06:23] godog: So if you have that command for making them sounds good [19:06:26] Krinkle: sure [19:06:31] if not i have a lot of setup to do, heh [19:06:52] robh: does it state boot from ssds. [19:07:05] where? [19:07:07] curious..as that was always a problem in the past [19:07:11] cmjohnson1: oh, asking if it does [19:07:15] i dont see anyplace to set that on the 710s [19:07:16] yes [19:07:18] unlike 310s [19:07:46] (03PS2) 10Yuvipanda: labmon: Add contint monitoring for /mnt space as well [puppet] - 10https://gerrit.wikimedia.org/r/161019 [19:07:47] Krinkle: done [19:07:50] I use the old raid bios ...so I know where to go there [19:08:00] ctrl -r [19:08:14] (03CR) 10Krinkle: [C: 031] "Yay!" [puppet] - 10https://gerrit.wikimedia.org/r/161019 (owner: 10Yuvipanda) [19:08:22] yes well, i do it in this and i dont see an option, but will reboot into the other one to check [19:08:31] and i see option in this method on 310s [19:08:32] mutante: can you merge ^? [19:08:40] (03CR) 10Hashar: [C: 031] "Awesome!" [puppet] - 10https://gerrit.wikimedia.org/r/161019 (owner: 10Yuvipanda) [19:08:45] YuviPanda: Ah, I see 'series' does it [19:09:37] robh: so yeah 12 disks raid0 + 2 ssd raid1 correct? [19:09:55] nope [19:10:00] ssds are also raid0 [19:10:01] ah no nevermind, raid1 is software [19:10:05] (03PS1) 10Aaron Schulz: Set bloom cache config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161021 [19:10:25] (03PS44) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [19:10:29] (03PS1) 10Manybubbles: Increase weight of author namespace in wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161022 (https://bugzilla.wikimedia.org/69771) [19:11:03] Krinkle: yeah, I added that last week [19:11:35] robh: MegaCli -CfgEachDskRaid0 [WT|WB] [NORA|RA] [Direct|Cached] [-enblPI] [19:12:16] (yes, the gift that keeps on giving) [19:12:25] cmjohnson1: so i dont see any bootable disk option in either version of raid bios screen on ms-be1001 [19:12:25] so i dont think thats the issue, dunno yet [19:12:40] yeah..it's turned off...it will only work if we create a vd [19:12:58] (03PS45) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [19:13:00] cmjohnson1: what do you mean? [19:13:08] whats turned off? [19:13:20] godog: yea i just have to get into the megacli prompt [19:13:25] the option to select a bootable disk is grayed out [19:13:29] which is something else during post, not ctrl+r [19:14:20] cmjohnson1: so it comes by all garbled on serial [19:14:34] can you reboot ms-be1001 and see what exactly the key combination is to enter megacli command line? [19:14:42] (or if you arent onsite it can wait and i'll google ;) [19:15:29] or there may not be one on this gen [19:15:33] i may be misrecalling. [19:16:35] now we're talking, http://wiki.hetzner.de/index.php/LSI_RAID_Controller/en [19:16:39] "JBOD" section [19:17:26] (03PS1) 10Yuvipanda: labmon: Spam the QA channel about integration project issues [puppet] - 10https://gerrit.wikimedia.org/r/161024 [19:18:51] godog: so i think the issue is i dont have an os [19:19:00] and i cannot load a raid command line like on some other cards [19:19:04] so i cannot do those easy commands =P [19:19:32] well you get an os with debian-installer from pxe [19:20:28] not with megacli support ;] [19:20:48] could then install it all though i suppose cuz i'd have networking [19:21:00] yes you get a shell [19:21:42] debian-installer with megacli support would be a fun project though if there are no udebs already [19:23:28] well [19:23:31] i dunno what it was [19:23:40] but manually redoing ms-be2001 to match ms-be1001 seems to be working [19:23:45] cmjohnson1: godog ^ [19:24:14] robh: cool! [19:24:43] \o/ [19:24:57] godog: i'm going to leave ms-be1001 depooled and offline for now though incase i need to refrence [19:25:04] i'll push back online within an hour though [19:25:16] once i have a couple of successfully installs online [19:26:11] robh: yeah an hour is fine [19:30:11] PROBLEM - swift eqiad-prod container availability on tungsten is CRITICAL: CRITICAL: 10.71% of data under the critical threshold [96.0] [19:31:13] yeah that's going to need some tuning, it isn't supposed to fire when a single machine goes down [19:32:07] (03PS11) 10Yuvipanda: [WIP] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [19:34:26] (03PS12) 10Yuvipanda: [WIP] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [19:36:05] robh: are also other machines in the same state as ms-be1001 with unusable megacli? I tried on ms-be1013 but the arrays all show up [19:36:35] godog: dunno, im done in 1001 so rebooting it now [19:36:42] to confirm it has that problem and im not just confused [19:37:10] my ms-be2001 install finished fine, but the boot order was fubar so it went back into it [19:38:25] * robh watches it with paranoia [19:38:32] anytime you reboot a machine it may not come back ;P [19:39:32] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 2.41 ms [19:39:33] haha let's hope not but yeah it sucks when it happens [19:39:34] and the installer killed the mbr, so reinstalling [19:39:40] killed ms-be2001 [19:39:43] ms-be1001 is good [19:39:48] and back online [19:40:10] cool [19:40:21] (03PS13) 10Yuvipanda: [WIP] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [19:40:42] (03PS1) 10Manybubbles: Push Cirrus' non-content enwiki shards apart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161031 [19:41:00] o/, opsen. Does one of you know how cpu measures work? Labs monitor provides cpu.total: guest, guest_nice, idle, iowait, irc, nice, softirq, steal, system and user. I have no idea how they relate to each other (which is negative, which is positive, which is part of another). If I want the total usage (0-100%) which do I use or add up? [19:41:06] but yeah as long as the SSD show up in the raid1 I think we're fine [19:41:10] Does 100% - idle cover everything? [19:41:27] does 100% - system - user - nice - iowait = idle? [19:45:52] Krinkle: an extra wrinkle in those measurements is cpu's that constantly adjust their Mhz. 50% of 900Mhz and 90% of 2.2Ghz often just show as 50% or 90% with no qualifier. Don't have a good solution :( [19:50:58] bah, now it seems ms-be2002 is in reboot loop and not getting boot on disk ... [19:51:02] 2001 even [19:55:17] godog: so yea, trying megacli commands on ms-be1001 doesnt work [19:55:20] and i have no idea why [19:56:00] oh god damn it [19:56:07] damn sudo [19:56:09] nevermind. [19:56:47] i've adapted to no longer login as root, but not to the fact that i need to use sudo. [19:57:01] it takes time ... [19:57:15] use an alias :D [19:59:19] im very anti alias, if only cuz when i then go onto a system without them i'd be useless [20:00:25] I see your point [20:00:26] robh: hah! phew at least it works [20:01:02] ahh, my ms-be systems in codfw install, but arent booting to disks (when bios says to) [20:01:02] so something isnt getting the bootable flag [20:01:15] so one issue down, and a new issue arises! [20:03:11] (03CR) 10Andrew Bogott: [C: 032] labmon: Add contint monitoring for /mnt space as well [puppet] - 10https://gerrit.wikimedia.org/r/161019 (owner: 10Yuvipanda) [20:03:45] (03CR) 10Andrew Bogott: [C: 032] labmon: Spam the QA channel about integration project issues [puppet] - 10https://gerrit.wikimedia.org/r/161024 (owner: 10Yuvipanda) [20:04:47] robh: I checked ms-be1013 and just one out of two ssd has the bootable flag (sdn, the unbootable is sdm) [20:05:20] ok [20:05:27] so i just have to figure out how the hell to set it [20:05:33] makes sense, i just didnt see the option in raid bios [20:06:34] i can force them online or offline [20:06:41] godog: what is after sdz ? [20:07:30] matanya: sdaa perhaps? can't remember for sure [20:08:01] yes, found: http://askubuntu.com/questions/47447/block-devices-sda-sdb-sdc-what-comes-after-sdz [20:08:16] robh: we could hack it from debian-installer [20:08:36] by just copying down the megacli stuff ? [20:08:45] there has to be a proper way to do it though i'd think [20:09:25] well yeah we can provide a .udeb that debian-installer installs with megacli, and we can use it from the installer [20:10:04] how hard is it to implement that? [20:10:24] im checking the other raid bios screen option for bootable flag [20:10:35] (you can enter it via ctrol+r or dell bios, resulting is different UI) [20:11:17] bah doable, but way easier/faster to do to just download (from our repo) and run [20:13:47] d-i partman/early_command to be specific [20:14:20] RECOVERY - swift eqiad-prod container availability on tungsten is OK: OK: Less than 1.00% under the threshold [98.0] [20:14:50] (03CR) 10Ori.livneh: [C: 031] use scap's embedded linking, remove lint script [puppet] - 10https://gerrit.wikimedia.org/r/160691 (https://bugzilla.wikimedia.org/68255) (owner: 10Filippo Giunchedi) [20:17:01] PROBLEM - very high load average likely xfs on ms-be1006 is CRITICAL: CRITICAL - load average: 222.08, 107.65, 53.10 [20:18:24] * godog shakes fist at XFS [20:18:56] trusty's kernel better not to have that problem [20:19:41] !log rebooting ms-be1006 [20:19:48] Logged the message, Master [20:21:50] PROBLEM - SSH on ms-be1006 is CRITICAL: Connection refused [20:21:50] PROBLEM - swift-account-reaper on ms-be1006 is CRITICAL: Connection refused by host [20:22:01] PROBLEM - swift-object-replicator on ms-be1006 is CRITICAL: Connection refused by host [20:22:21] PROBLEM - swift-account-replicator on ms-be1006 is CRITICAL: Connection refused by host [20:26:58] (03PS14) 10Yuvipanda: [WIP] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [20:30:05] RECOVERY - very high load average likely xfs on ms-be1006 is OK: OK - load average: 2.18, 0.46, 0.15 [20:30:05] RECOVERY - swift-object-replicator on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [20:30:21] RECOVERY - swift-account-replicator on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [20:30:42] RECOVERY - SSH on ms-be1006 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [20:30:50] RECOVERY - swift-account-reaper on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:32:36] (03PS1) 10RobH: star.wmfusercontent.org.pem update star.wmfusercontent.org.pem update [puppet] - 10https://gerrit.wikimedia.org/r/161049 [20:34:06] (03CR) 10RobH: [C: 032] "yes, all the issues due to missing - character." [puppet] - 10https://gerrit.wikimedia.org/r/161049 (owner: 10RobH) [20:34:37] mutante: ^ its live [20:35:13] robh: thx! checking [20:35:25] stupid paste issues are awesome. [20:35:31] indeed [20:35:40] one char can make such a difference :p [20:35:54] yay. thanks. [20:36:05] ok, so now i return to why the hell are my ms-be codfw hosts not getting bootable flags on ssd [20:36:08] and how to fix. [20:36:14] cuz i dunno =P [20:36:21] oh, wait [20:36:24] .... found it. [20:36:37] god damn it. [20:36:49] im happy but pissed it took me this long [20:37:00] robh: You've said god damn it quite a lot today :p [20:37:16] its been one of those days =P [20:37:41] Yesterday was awesome, everything went smoothly [20:37:52] because i ignored all of the problems until i had to address them today ;D [20:38:52] mutante: let me know if that works once its live =] [20:39:14] robh: yep, on it [20:39:21] figured you were, thanks [20:39:57] yay, ms-be2001 is installed! [20:39:59] robh: look on the bright side - codfw will be awesome because of you :D [20:40:05] godog: ^ so this means I can make the rest of them work as well [20:40:12] JohnFLewis: thx =] [20:41:32] robh: ah that's sweet, so the bootable issue was fixed too? [20:42:23] robh: so the puppet runs on cp1043/1044 are fine now [20:42:30] mutante: awesome! [20:43:22] the second site for it being enabled in nginx though.. is another thing [20:43:46] definitely progress, thx for the fix as well [20:48:50] (03CR) 10Dzahn: "yep, confirmed, before "can't load certificate" error, now just fine" [puppet] - 10https://gerrit.wikimedia.org/r/161049 (owner: 10RobH) [20:51:22] (03CR) 10Andrew Bogott: [C: 032] Virt1000 is no longer a redis slave of virt0. [puppet] - 10https://gerrit.wikimedia.org/r/161012 (owner: 10Andrew Bogott) [20:53:55] gwicke, cscott, subbu, are you still deploying? [20:54:12] nope, not today. [20:54:39] greg-g, can i push something https://gerrit.wikimedia.org/r/#/c/161087/ [20:55:51] yurikR1: it didn't break beta, right? :) [20:56:11] zeroportal [20:56:11] greg-g, its zeroportal only, shouldn't :) [20:56:30] :) [20:56:33] yurikR1: sure, doit [20:57:02] urghhh [20:57:06] so many ms-be hsots are H310s [20:57:08] so many! [20:57:09] greg-g, btw, i think we should have one section per day for the https://wikitech.wikimedia.org/wiki/Deployments [20:57:23] makes editing much easier [20:58:47] yurikR1: agree, will need a change to the lua module that generateds it [20:58:54] (03PS3) 10Andrew Bogott: Tell palladium about all the domains [puppet] - 10https://gerrit.wikimedia.org/r/160848 [20:59:15] greg-g, you generate it with lua? what updates the wiki? [21:00:13] (03CR) 10Andrew Bogott: [C: 032] Tell palladium about all the domains [puppet] - 10https://gerrit.wikimedia.org/r/160848 (owner: 10Andrew Bogott) [21:02:11] yurikR1: people do, but the formatting is done by lua [21:05:48] (03PS5) 10Andrew Bogott: Turn on dns and ldap on labcontrol2001. [puppet] - 10https://gerrit.wikimedia.org/r/161013 [21:07:52] (03CR) 10Andrew Bogott: [C: 032] Turn on dns and ldap on labcontrol2001. [puppet] - 10https://gerrit.wikimedia.org/r/161013 (owner: 10Andrew Bogott) [21:07:58] !log yurik Synchronized php-1.24wmf21/extensions/ZeroPortal/: (no message) (duration: 01m 05s) [21:08:04] Logged the message, Master [21:09:56] (03CR) 10Dzahn: "it means we have to keep a separate SSL cert, on a so called "misc" box. or the one for the monitoring host must include all the needed SA" [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [21:11:31] !log restarting rebuilding cirrus's enwiki index now that I've found the reason it wasn't working before - the new index was putting too many shards on an already full node and overwhelming it. silly allocation algorithm! thats a bad idea! [21:11:37] Logged the message, Master [21:12:41] (03CR) 10Dzahn: [C: 031] wikimedia.org: clarify labsconsole CNAME [dns] - 10https://gerrit.wikimedia.org/r/160454 (owner: 10Filippo Giunchedi) [21:13:41] (03CR) 10Andrew Bogott: [C: 031] "Good point Daniel, this is better." [dns] - 10https://gerrit.wikimedia.org/r/160454 (owner: 10Filippo Giunchedi) [21:13:49] (03CR) 10Dzahn: "one more attempt after fenari is really dead :)" [puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [21:16:00] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [21:19:18] PROBLEM - puppet last run on virt0 is CRITICAL: CRITICAL: Epic puppet fail [21:21:24] (03PS1) 10Andrew Bogott: Don't rely on labcontrol2001 quite yet. [puppet] - 10https://gerrit.wikimedia.org/r/161092 [21:21:39] (03CR) 10jenkins-bot: [V: 04-1] Don't rely on labcontrol2001 quite yet. [puppet] - 10https://gerrit.wikimedia.org/r/161092 (owner: 10Andrew Bogott) [21:22:24] (03PS2) 10Andrew Bogott: Don't rely on labcontrol2001 quite yet. [puppet] - 10https://gerrit.wikimedia.org/r/161092 [21:23:12] (03CR) 10Andrew Bogott: [C: 032] Don't rely on labcontrol2001 quite yet. [puppet] - 10https://gerrit.wikimedia.org/r/161092 (owner: 10Andrew Bogott) [21:24:58] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 3.713 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [21:26:18] RECOVERY - puppet last run on virt0 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [21:39:31] (03PS1) 10Andrew Bogott: Set up ldap servers for codfw [puppet] - 10https://gerrit.wikimedia.org/r/161094 [21:41:12] Gerrit down? [21:41:29] it's 503ing randomly for me [21:41:32] Hmm, service unavailable for a few seconds, then came back. [21:41:34] WFM if a little slow [21:41:35] wfm [21:41:43] !log fixing updates on planet feeds - file permissions [21:41:49] Logged the message, Master [21:42:38] PROBLEM - puppet last run on fenari is CRITICAL: CRITICAL: Puppet has 1 failures [21:44:22] (03CR) 10Andrew Bogott: [C: 032] Set up ldap servers for codfw [puppet] - 10https://gerrit.wikimedia.org/r/161094 (owner: 10Andrew Bogott) [21:48:38] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [21:50:19] Danny_B: do you know the people running wikimedia.cz? [21:54:49] (03PS1) 10Dzahn: planet (en) - remove chetblog feed (404) [puppet] - 10https://gerrit.wikimedia.org/r/161098 [22:10:42] (03PS1) 10RobH: setting ms-be2002 dhcp params [puppet] - 10https://gerrit.wikimedia.org/r/161103 [22:11:38] (03CR) 10RobH: [C: 032] setting ms-be2002 dhcp params [puppet] - 10https://gerrit.wikimedia.org/r/161103 (owner: 10RobH) [22:20:49] (03PS1) 10BBlack: use v6 SLAAC tokens for interface::tagged, optionally [puppet] - 10https://gerrit.wikimedia.org/r/161106 [22:20:59] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:21:11] screw you strontium, keep up with the times [22:22:34] (03CR) 10BBlack: [C: 032] "Only affects lvs200x for now, testing." [puppet] - 10https://gerrit.wikimedia.org/r/161106 (owner: 10BBlack) [22:23:00] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:26:20] (03PS1) 10BBlack: apparently no string interp for inline_template? [puppet] - 10https://gerrit.wikimedia.org/r/161110 [22:26:44] (03CR) 10BBlack: [C: 032 V: 032] apparently no string interp for inline_template? [puppet] - 10https://gerrit.wikimedia.org/r/161110 (owner: 10BBlack) [22:27:22] grrrr [22:27:25] puppet :p [22:31:41] (03PS1) 10BBlack: perhaps @var will work [puppet] - 10https://gerrit.wikimedia.org/r/161111 [22:31:56] (03CR) 10BBlack: [C: 032 V: 032] perhaps @var will work [puppet] - 10https://gerrit.wikimedia.org/r/161111 (owner: 10BBlack) [22:42:38] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Epic puppet fail [22:43:18] (03PS1) 10BBlack: use regsubst instead of inline_template [puppet] - 10https://gerrit.wikimedia.org/r/161118 [22:43:42] (03CR) 10BBlack: [C: 032 V: 032] use regsubst instead of inline_template [puppet] - 10https://gerrit.wikimedia.org/r/161118 (owner: 10BBlack) [22:44:15] Heads up, I think Multimedia is the only team with SWAT patches, so I'm gonna do it [22:44:29] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [22:45:20] (03CR) 10Dzahn: [C: 032] wikimedia.org: clarify labsconsole CNAME [dns] - 10https://gerrit.wikimedia.org/r/160454 (owner: 10Filippo Giunchedi) [22:45:40] (03PS1) 10BBlack: fix quoting in token stuff [puppet] - 10https://gerrit.wikimedia.org/r/161120 [22:46:07] (03CR) 10BBlack: [C: 032 V: 032] fix quoting in token stuff [puppet] - 10https://gerrit.wikimedia.org/r/161120 (owner: 10BBlack) [22:52:38] (03PS1) 10BBlack: fixup iface addrs for lvs200x [puppet] - 10https://gerrit.wikimedia.org/r/161123 [22:52:47] (03CR) 10BBlack: [C: 032 V: 032] fixup iface addrs for lvs200x [puppet] - 10https://gerrit.wikimedia.org/r/161123 (owner: 10BBlack) [22:53:39] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Epic puppet fail [23:00:05] RoanKattouw, ^d, marktraceur, MaxSem: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140917T2300). Please do the needful. [23:00:12] * MaxSem goes for it [23:00:18] No no no [23:00:27] MaxSem: It's only Multimedia, I'm doing it [23:00:35] awlrite [23:00:40] Read backscroll :P [23:00:56] MaxSem: Behave [23:01:08] Three submodule updates to do [23:01:09] WHAT I'M NOT ALLOWED TO BREAK MULTIMEDIA NOW? [23:01:20] MaxSem: No, that's my job damn it [23:01:52] tgr: Ready to verify stuff? [23:01:57] yep [23:02:08] Sweet. [23:06:05] Argh [23:06:12] Mistyped a git rebase command [23:06:17] Might take a sec to undo [23:06:55] marktraceur: I have another patch [23:06:57] marktraceur: Hey, there's an urgent patch. [23:06:59] Snap with RoanKattouw. [23:07:00] Argh [23:07:15] MaxSem merged it a few minutes ago but he ignored Timo's comments, so now I'm writing a follow-up [23:07:20] Let me fix my stupidity and I'll do that after the UW [23:07:29] And the two patches together need SWATing to wmf21. [23:07:34] marktraceur: Have fun. [23:07:37] Yeah yeah [23:08:21] Whew. [23:08:24] That was scary [23:08:25] marktraceur, can you swat something for me too? [23:08:27] Oookay [23:08:34] yurikR1, RoanKattouw, add your patches to the list [23:08:47] marktraceur: Roan should write his first. :-) [23:08:49] Also, consider this me wagging my finger at you for being slow. :P [23:10:39] Yeah, I'm sorryt [23:10:43] !log marktraceur Synchronized php-1.24wmf21/extensions/UploadWizard/: [SWAT] Fix EventLogging schema declarations for UploadWizard (duration: 00m 11s) [23:10:49] Logged the message, Master [23:10:50] No prblem [23:10:57] We thought we could go with the patch that was there, and Max merged it more or less on time, but then we found a problem [23:11:03] OK, yurikR1, is yours, like, in-the-next-five-minutes urgent? [23:11:21] marktraceur, its a graph ext - there is a minor security patch [23:11:29] OK, then we'll do that next. [23:11:39] Because the MMV patch is just fixing a UI issue [23:11:52] tgr: I think you can verify the UW patch now [23:11:56] marktraceur, adding it to the wiki - just push Graph ext master to 21 [23:12:04] (03CR) 10Dzahn: [C: 031] metrics: move from stat1001 to varnish [puppet] - 10https://gerrit.wikimedia.org/r/160926 (owner: 10Filippo Giunchedi) [23:13:02] (03CR) 10Dzahn: [C: 031] "just needs to be done after Change-Id: I571fb86aea7115eb12 and Change-Id: I950b170468152" [puppet] - 10https://gerrit.wikimedia.org/r/160927 (owner: 10Filippo Giunchedi) [23:13:13] yurikR1: I suppose you don't have a submodule update patch yet? :) [23:13:25] marktraceur, nope :) [23:13:27] Fun times [23:13:30] No problemo [23:13:31] sorry [23:13:44] It's fine, I haven't done one in a while [23:13:48] will have a tiny config patch to enable it on meta in a sec [23:13:49] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [23:17:17] (03CR) 10Dzahn: [C: 031] "looks ok, just that it could cause some temp. monitoring spam if the old rules are removed before the new ones are added" [puppet] - 10https://gerrit.wikimedia.org/r/160802 (owner: 10Ottomata) [23:17:18] marktraceur: doesn't seem to work [23:17:27] tgr: Hm [23:17:28] at least didn't break anything either [23:17:33] That is a start [23:17:34] (03PS1) 10Yurik: Enable graph extension on labswiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161125 [23:17:42] Maybe the fix didn't work? [23:17:51] marktraceur, ^^ [23:17:52] I mean, as opposed to it not taking root [23:17:58] yurikR1: K, add that to the wiki too [23:18:06] yep, doing that now [23:18:17] This has to be the most hodgepodge deployment I've ever done. [23:18:23] Exciting! But hodgepodge. [23:18:48] marktraceur, hmm, when are we migrating labswiki to 21? [23:18:56] Uh, I dunno. [23:19:02] I think mutante is the one dealing with wikitech [23:19:05] ok, better push graph to 20 too :) [23:19:11] marktraceur: probably [23:19:23] sorry :) [23:19:24] Well then. [23:19:27] i could do it [23:19:30] I'll try to write a working one for the next swat [23:19:40] yurikR1: If you want to do the submodule update for 20, that'd be super [23:19:47] ok [23:19:54] Apparently my update-mediawiki cronjob didn't work very well [23:20:56] marktraceur, did you do the 21 already or should i do it too? [23:20:57] yurikR1: I don't see any new changes for Extension:Graph in wmf21 [23:21:04] Which patch is being pushed? [23:21:13] (03CR) 10Dzahn: [C: 031] "this is cool and would fix https://rt.wikimedia.org/Ticket/Display.html?id=7438 .. as long as dsh group "apaches" is really equal to just" [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [23:21:36] Oh, you want it updated to master. [23:22:17] someone pinged me? [23:22:30] marktraceur, https://gerrit.wikimedia.org/r/#/c/160369/ [23:22:32] Uhh, I don't think so [23:22:44] OK [23:22:50] but yes, better update it to master. Do you want me to patch both? [23:22:58] (03CR) 10Dzahn: "this is cool and would fix https://rt.wikimedia.org/Ticket/Display.html?id=7438 .. as long as dsh group "apaches" is really equal to just" [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [23:23:07] Sure sure [23:23:11] the mystery of who pinged hoo. [23:23:16] I can do one of the MMV patches in the meantime [23:23:17] legoktm: hoom. [23:23:17] marktraceur: per https://commons.wikimedia.org/wiki/Special:Version , Upload Wizard is still at https://git.wikimedia.org/tree/mediawiki%2Fextensions%2FUploadWizard.git/5a1fafe0849227387ff606640d6a6519cc27b520 [23:23:18] legoktm++ [23:23:29] is it possible that page is lagging? [23:23:36] (03CR) 10Dzahn: "when doing this, it should also replace the setup in beta, which uses salt but different.. would be great if that become the same again in" [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [23:23:37] Yeah, S:V lags, pretty sure [23:24:04] bd808: ^ check that out [23:24:09] legoktm: My scrollback is to short to see it :D [23:24:28] well it's not in my scrollback [23:26:32] marktraceur, updated https://wikitech.wikimedia.org/wiki/Deployments#Wednesday.2C.C2.A0September.C2.A017 [23:26:36] patches are waiting [23:26:51] Cool thanks [23:26:55] meaning - not merged, need +2 [23:26:59] Right [23:27:09] I'm waiting for jenkins ATM [23:27:15] poke it :) [23:27:28] Eh, nothing for it [23:29:01] mutante: Heh. I don't remember what the beta version ended up looking like. [23:29:28] bd808: somehow using salt but not exactly like that :) [23:29:34] !log marktraceur Synchronized php-1.24wmf20/extensions/MultimediaViewer/: [SWAT] Fix reuse dropdown message weirdness (duration: 00m 08s) [23:29:40] Logged the message, Master [23:29:57] 21 next... [23:30:22] marktraceur, want to do one more tiny one? :) [23:30:33] yurikR1: Why not, seems like I'm going to be around a while anyway :) [23:30:42] hehe [23:30:45] sec [23:31:15] (03CR) 10BryanDavis: "The version of this I wrote for beta is at (03CR) 10Dzahn: "hi Nuria, you said on IRC all merges needed were done by ottomata. does this mean this can be abandoned? or is it still needed." [puppet] - 10https://gerrit.wikimedia.org/r/160679 (owner: 10Nuria) [23:33:14] (03Abandoned) 10Nuria: Moving /var/lib/wikimetrics to /srv/wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/160679 (owner: 10Nuria) [23:35:01] !log marktraceur Synchronized php-1.24wmf21/extensions/MultimediaViewer/: [SWAT] Fix reuse dropdown message weirdness (duration: 00m 07s) [23:35:07] Logged the message, Master [23:35:13] tgr: OK, dropdown patches synced, plz to be verifying [23:35:30] verified on wmf20 [23:36:04] Sweet [23:36:15] (next is yurikR1 btw) [23:36:32] yurikR1: I suppose you won't be able to test the extension updates until we turn on the extension... [23:36:49] marktraceur, its fine, not used anywhere yet :) [23:36:58] Cool. [23:38:00] marktraceur, last one - https://gerrit.wikimedia.org/r/161133 [23:38:04] I see it [23:38:11] Sounds like a party [23:38:20] thx :) [23:39:12] (03CR) 10Dzahn: "topic branch looks correct :) added _joe_" [puppet] - 10https://gerrit.wikimedia.org/r/159636 (owner: 10Hoo man) [23:39:46] :D [23:40:29] marktraceur: OK I've finally emerged victorious and gotten git to cherry-pick this for me. Could you deploy https://gerrit.wikimedia.org/r/#/c/161134 and its dependency please? [23:40:34] I'll retroactively add them to the calendar [23:40:37] marktraceur: verified on wmf21 as well [23:40:57] Sweet tgr, thanks [23:41:23] I declare tgr's patches done, now syncing Graph [23:41:56] !log marktraceur Synchronized php-1.24wmf20/extensions/Graph/: [SWAT] Update Graph to master (duration: 00m 07s) [23:42:00] yurikR1: No i18n updates, right? [23:42:00] Logged the message, Master [23:42:13] I can still scap. [23:42:26] marktraceur, none that i care particularly about ) [23:42:30] Sounds fun [23:42:40] !log marktraceur Synchronized php-1.24wmf21/extensions/Graph/: [SWAT] Update Graph to master (duration: 00m 08s) [23:42:46] Logged the message, Master [23:42:47] Next, yurikR1's config change. [23:42:53] yei ) [23:43:25] (03CR) 10MarkTraceur: [C: 032] Enable graph extension on labswiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161125 (owner: 10Yurik) [23:43:30] (03Merged) 10jenkins-bot: Enable graph extension on labswiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161125 (owner: 10Yurik) [23:43:37] thx ) [23:44:28] marktraceur: how about another round of fixing UW? :) [23:44:58] tgr: Why not, might as well be a two-hour SWAT [23:45:00] Christ. :P [23:45:21] Sorry Mark :| [23:46:23] Uh, wait. yurikR1, I'm enabling a new extension? [23:46:40] Pretty sure that's not totally kosher, but if you have a compelling bug with backstory I'll do it anyway [23:47:00] marktraceur, it has already been enabled for a while on other wikis, csteip has checked security [23:47:08] Hmmm. [23:47:12] I skeptic, but I do [23:47:13] hence meta & labs for ppl to play on [23:47:26] !log marktraceur Synchronized wmf-config/InitialiseSettings.php: [SWAT] Enable Graph on metawiki and labswiki (duration: 00m 10s) [23:47:29] Please to be testing thoroughly [23:47:31] Logged the message, Master [23:47:35] :) [23:47:37] alwaays! [23:47:49] the ext is very simple - all the work is done in js [23:47:56] Cool beans. [23:48:06] (03CR) 10Dzahn: "i can't connect to that URL from f.e. analytics1003 yet, but rolematcher.py is installed there" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/160467 (owner: 10Giuseppe Lavagetto) [23:48:20] RoanKattouw: OK, let's do yours next. [23:48:26] I think I might have one patch left for yurikR1 [23:48:32] But that can be after [23:49:10] RoanKattouw: Was this cherry-picked manually? [23:49:32] (cc Krinkle) [23:49:38] Yes, very [23:49:41] OK [23:49:44] 's fine. [23:49:47] It conflicted to hell and back [23:49:53] We're gonna be real careful with this one :) [23:50:09] Because the patch in master moves a bunch of files, some of which had already been moved in wmf21, some of which had not been [23:50:16] Yeah, I can see why :) [23:50:24] RoanKattouw: I assume if I try to deploy only the first one, I will not be going to space today [23:50:37] Although right now those files are in a directory that's not exposed to the web so how much worse can it get [23:50:44] lol [23:50:49] Yeah please do deploy them together [23:51:21] Though as I said you weren't going to space yesterday either so... [23:51:35] Truth [23:51:44] Well, I mean, you don't know my life [23:51:50] But it just so happens you're right [23:51:52] (By which I mean, the worst that can happen is that something that was broken will stay broken) [23:51:57] True. [23:52:34] Yeah you live closer to decommissioned space installations than I do so I don't know what you might have going on [23:53:26] (03CR) 10Dzahn: [C: 031] "i think ottomata is right about having the history in git, since we might lose gerrit history. merge and delete? :)" [puppet] - 10https://gerrit.wikimedia.org/r/151095 (owner: 10Ottomata) [23:55:37] (03CR) 10Dzahn: "what Jan said above" [puppet] - 10https://gerrit.wikimedia.org/r/151523 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [23:55:52] RoanKattouw: What do you think, just a sync-dir? Scap? [23:56:05] scap [23:56:08] Fun times [23:56:18] Not because we have i18n changes but because we're like moving crap between directories [23:56:24] yeah I figured [23:56:27] And I don't trust myself to produce an exhaustive list of what's affected [23:56:32] Well, cross yer fingers [23:56:36] Yeeaah [23:57:09] !log marktraceur Started scap: [SWAT] Move things out of assets/ and into resources/assets/ [23:57:15] Logged the message, Master [23:57:45] See 'man 7 undocumented' for help when manual pages are not available. [23:57:56] undocumented - No manpage for this program, utility or function.