[00:03:54] PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: Puppet has 1 failures [00:12:28] reedy@tin:/srv/mediawiki-staging/php-1.24wmf21$ mwscript extensions/WikimediaMaintenance/makeSizeDBLists.php --wiki=mediawikiwiki [00:12:28] DB connection error: Can't connect to MySQL server on '208.80.154.18' (4) (208.80.154.18) [00:12:29] damn wikitech [00:17:09] (03PS1) 10Reedy: Update size related dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160881 [00:17:46] heh, betawikiversity got smaller [00:18:21] (03CR) 10Reedy: [C: 032] Update size related dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160881 (owner: 10Reedy) [00:18:26] (03Merged) 10jenkins-bot: Update size related dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160881 (owner: 10Reedy) [00:18:47] !log reedy Synchronized database lists: (no message) (duration: 00m 15s) [00:18:53] Logged the message, Master [00:20:47] (03PS2) 10Reedy: Set $wgCategoryCollation to 'uca-hr' on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/147922 (https://bugzilla.wikimedia.org/67287) (owner: 10Bartosz Dziewoński) [00:20:50] (03PS2) 10Reedy: Set $wgCategoryCollation to 'xx-uca-et' on all Estonian-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154213 (https://bugzilla.wikimedia.org/54168) (owner: 10Bartosz Dziewoński) [00:20:54] (03PS2) 10Reedy: Set $wgCategoryCollation to 'uca-sk' on skwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/140580 (owner: 10Danny B.) [00:20:57] (03PS2) 10Reedy: Set $wgCategoryCollation to 'uca-fr' on frwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155241 (https://bugzilla.wikimedia.org/69782) (owner: 10Bartosz Dziewoński) [00:21:14] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [00:21:35] RECOVERY - Disk space on lanthanum is OK: DISK OK [00:24:23] (03CR) 10Reedy: [C: 032] Set $wgCategoryCollation to 'uca-sk' on skwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/140580 (owner: 10Danny B.) [00:24:35] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to 'uca-sk' on skwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/140580 (owner: 10Danny B.) [00:25:01] :o [00:25:03] !log reedy Synchronized wmf-config/InitialiseSettings.php: skwiki collation (duration: 00m 15s) [00:25:10] Logged the message, Master [00:26:40] !log Running `mwscript updateCollation.php --wiki=skwiki --previous-collation=uppercase` in screen on tin [00:26:45] Logged the message, Master [00:28:26] looks to speed up... [00:29:43] 190000/744573 [00:30:23] whee [00:31:26] 260000 [00:39:58] 700000 [00:40:44] 15m30.899s [00:40:49] !log updateCollation on skwiki done [00:40:56] Logged the message, Master [00:41:07] (03PS3) 10Reedy: Set $wgCategoryCollation to 'uca-fr' on frwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155241 (https://bugzilla.wikimedia.org/69782) (owner: 10Bartosz Dziewoński) [00:41:12] (03CR) 10Reedy: [C: 032] Set $wgCategoryCollation to 'uca-fr' on frwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155241 (https://bugzilla.wikimedia.org/69782) (owner: 10Bartosz Dziewoński) [00:41:18] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to 'uca-fr' on frwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155241 (https://bugzilla.wikimedia.org/69782) (owner: 10Bartosz Dziewoński) [00:42:26] !log reedy Synchronized wmf-config/InitialiseSettings.php: frwikiversity collation (duration: 00m 17s) [00:42:32] Logged the message, Master [00:42:53] !log running `mwscript updateCollation.php --wiki=frwikiversity --previous-collation=uppercase` in screen on tin [00:42:56] tiny wiki is tiny [00:42:59] Logged the message, Master [00:43:34] !log updateCollation on frwikiversity done [00:43:41] Logged the message, Master [00:44:26] (03CR) 10Reedy: [C: 04-1] "Needs rebasing. lol" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/147922 (https://bugzilla.wikimedia.org/67287) (owner: 10Bartosz Dziewoński) [00:44:46] (03PS3) 10Reedy: Set $wgCategoryCollation to 'xx-uca-et' on all Estonian-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154213 (https://bugzilla.wikimedia.org/54168) (owner: 10Bartosz Dziewoński) [00:44:50] (03CR) 10Reedy: [C: 032] Set $wgCategoryCollation to 'xx-uca-et' on all Estonian-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154213 (https://bugzilla.wikimedia.org/54168) (owner: 10Bartosz Dziewoński) [00:44:55] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to 'xx-uca-et' on all Estonian-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154213 (https://bugzilla.wikimedia.org/54168) (owner: 10Bartosz Dziewoński) [00:45:26] !log reedy Synchronized wmf-config/InitialiseSettings.php: et collations (duration: 00m 15s) [00:45:33] Logged the message, Master [00:45:56] !log running `mwscript updateCollation.php --wiki=etwiki --previous-collation=uppercase` in screen on tin [00:46:02] Logged the message, Master [00:46:04] (03PS1) 10MZMcBride: Minor fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160890 [00:46:31] !log etwikibooks collation updated (280 rows) [00:46:37] Logged the message, Master [00:46:49] !log etwikimedia collation updated (121 rows) [00:46:54] Logged the message, Master [00:47:25] !log etwikiquote collation updated (706 rows) [00:47:31] Logged the message, Master [00:47:49] !log etwikisource collation updated (9918 rows) [00:47:55] Logged the message, Master [00:48:12] !log running `mwscript updateCollation.php --wiki=etwiktionary --previous-collation=uppercase` in screen on tin [00:48:18] Logged the message, Master [00:52:56] !log updateCollation on etwiktionary done [00:53:03] Logged the message, Master [00:53:07] !log updateCollation on etwiki done [00:53:13] Logged the message, Master [00:57:46] (03PS3) 10Reedy: Set $wgCategoryCollation to 'uca-hr' on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/147922 (https://bugzilla.wikimedia.org/67287) (owner: 10Bartosz Dziewoński) [00:58:03] (03CR) 10Reedy: [C: 032] Set $wgCategoryCollation to 'uca-hr' on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/147922 (https://bugzilla.wikimedia.org/67287) (owner: 10Bartosz Dziewoński) [00:58:08] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to 'uca-hr' on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/147922 (https://bugzilla.wikimedia.org/67287) (owner: 10Bartosz Dziewoński) [00:58:42] !log reedy Synchronized wmf-config/InitialiseSettings.php: shwiki collation (duration: 00m 16s) [00:58:49] Logged the message, Master [00:59:16] !log running `mwscript updateCollation.php --wiki=shwiki --previous-collation=uppercase` in screen on tin [00:59:21] Logged the message, Master [01:15:34] !log updateCollation on shwiki done [01:15:42] Logged the message, Master [01:26:06] (03PS1) 10BBlack: fix ipv6 revdns for install2001 [dns] - 10https://gerrit.wikimedia.org/r/160895 [01:26:08] (03PS1) 10BBlack: add v6 dns for acamar+achernar [dns] - 10https://gerrit.wikimedia.org/r/160896 [01:26:20] (03CR) 10jenkins-bot: [V: 04-1] add v6 dns for acamar+achernar [dns] - 10https://gerrit.wikimedia.org/r/160896 (owner: 10BBlack) [01:26:27] (03CR) 10BBlack: [C: 032] fix ipv6 revdns for install2001 [dns] - 10https://gerrit.wikimedia.org/r/160895 (owner: 10BBlack) [01:27:15] (03PS1) 10Springle: depool s6 db1015 and s7 db1039 for codfw cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160897 [01:27:39] (03CR) 10Springle: [C: 032] depool s6 db1015 and s7 db1039 for codfw cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160897 (owner: 10Springle) [01:27:42] (03PS2) 10BBlack: add v6 dns for acamar+achernar [dns] - 10https://gerrit.wikimedia.org/r/160896 [01:27:45] (03Merged) 10jenkins-bot: depool s6 db1015 and s7 db1039 for codfw cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160897 (owner: 10Springle) [01:28:15] (03CR) 10BBlack: [C: 032] add v6 dns for acamar+achernar [dns] - 10https://gerrit.wikimedia.org/r/160896 (owner: 10BBlack) [01:29:02] !log springle Synchronized wmf-config/db-eqiad.php: depool s6 db1015 and s7 db1039 (duration: 00m 20s) [01:29:11] Logged the message, Master [01:33:14] !log xtrabackup clone db1015 to db2028 [01:33:22] Logged the message, Master [01:33:24] !log xtrabackup clone db1039 to db2029 [01:33:29] Logged the message, Master [01:35:09] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=616 [critical =500]: recurring_gc_contribs_missed=0: recurring_gc_failures_missed=0: recurring_gc_jobs_required=959: recurring_gc_schedule_sanity=0 [01:40:11] RECOVERY - check_fundraising_jobs on db1025 is OK: OK missing_thank_yous=0: recurring_gc_contribs_missed=0: recurring_gc_failures_missed=0: recurring_gc_jobs_required=959: recurring_gc_schedule_sanity=0 [01:40:48] (03PS1) 10Springle: assign codfw slaves: x1 db2009, m1 db2010, m2 db2011, m3 db2012 [puppet] - 10https://gerrit.wikimedia.org/r/160898 [01:43:44] (03CR) 10Springle: [C: 032] assign codfw slaves: x1 db2009, m1 db2010, m2 db2011, m3 db2012 [puppet] - 10https://gerrit.wikimedia.org/r/160898 (owner: 10Springle) [01:49:12] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: Puppet has 3 failures [01:54:39] !log xtrabackup clone db1031 to db2009 [01:54:44] Logged the message, Master [02:00:25] !log xtrabackup clone db1016 to db2010 [02:00:31] Logged the message, Master [02:08:02] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Puppet has 1 failures [02:11:12] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3613 MB (3% inode=99%): [02:15:20] !log xtrabackup clone db1046 to db2011 [02:15:27] Logged the message, Master [02:17:53] RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [02:21:27] !log xtrabackup clone db1048 to db2012 [02:21:33] Logged the message, Master [02:26:32] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 1 failures [02:28:22] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [02:43:02] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-17 02:43:02+00:00 [02:43:09] Logged the message, Master [02:43:54] PROBLEM - puppet last run on mw1067 is CRITICAL: CRITICAL: Puppet has 1 failures [02:45:52] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [02:58:39] (03CR) 10Chmarkine: [C: 031] tendril.wm.org - move behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [03:01:02] RECOVERY - Disk space on virt0 is OK: DISK OK [03:02:53] (03PS1) 10Springle: repool db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160906 [03:03:12] RECOVERY - puppet last run on mw1067 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [03:03:44] (03CR) 10Springle: [C: 032] repool db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160906 (owner: 10Springle) [03:03:48] (03Merged) 10jenkins-bot: repool db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160906 (owner: 10Springle) [03:07:00] !log springle Synchronized wmf-config/db-eqiad.php: repool s6 db1015 (duration: 01m 41s) [03:07:06] Logged the message, Master [03:11:02] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 1 failures [03:17:38] !log LocalisationUpdate completed (1.24wmf21) at 2014-09-17 03:17:38+00:00 [03:17:44] Logged the message, Master [03:29:14] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [04:30:52] (03PS2) 10Tim Starling: Remove bits.wikimedia.org/robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154234 [04:31:27] (03CR) 10Tim Starling: [C: 032] "PS2: rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154234 (owner: 10Tim Starling) [04:31:31] (03Merged) 10jenkins-bot: Remove bits.wikimedia.org/robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154234 (owner: 10Tim Starling) [04:32:17] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Sep 17 04:32:17 UTC 2014 (duration 32m 16s) [04:32:24] Logged the message, Master [04:34:11] !log tstarling Synchronized docroot/bits: (no message) (duration: 00m 10s) [04:34:17] Logged the message, Master [04:46:13] RECOVERY - CI tmpfs disk space on lanthanum is OK: DISK OK [04:56:24] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Epic puppet fail [05:14:44] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [05:47:50] (03PS3) 10Giuseppe Lavagetto: icinga - use apache::site [puppet] - 10https://gerrit.wikimedia.org/r/160820 (owner: 10Dzahn) [05:49:04] (03CR) 10Giuseppe Lavagetto: [C: 032] icinga - use apache::site [puppet] - 10https://gerrit.wikimedia.org/r/160820 (owner: 10Dzahn) [06:05:59] akosiaris: ping [06:06:26] <_joe_> cajoel: pings at this time in the morning are ok only if coming with coffee [06:06:34] <_joe_> :) [06:06:38] it's 11pm [06:06:41] where I'm sitting [06:06:46] so I need a pillow [06:06:51] <_joe_> cajoel: 8 AM here [06:07:07] <_joe_> I'd use a pillow as well [06:07:14] maybe a snuggie [06:07:30] alex had a gerrit patch set for puppet + openldap [06:07:32] I can't find it [06:08:41] and he's too prolific in gerrit to make it easy [06:08:50] <_joe_> https://gerrit.wikimedia.org/r/#/c/156322/ ? [06:09:09] ding [06:09:10] thanks [06:09:21] apparently I'm no good at gerrit search [06:09:34] <_joe_> it's a skill you develop to survive :P [06:09:46] <_joe_> I usually use owner: status: project: [06:11:41] (03PS2) 10Giuseppe Lavagetto: mediawiki: make HAT appservers a separate cluster in ganglia [puppet] - 10https://gerrit.wikimedia.org/r/160624 [06:16:55] <_joe_> cajoel: btw, my thunderbird heuristic scam detection goes bananas with all the openvas related emails :P [06:17:08] heh [06:17:19] HTML in zip file!! oh no! [06:23:16] Q: when hacking local puppet apply, how do I specify where to pick up a template file? [06:24:01] <_joe_> cajoel: you can specify the puppetdir, templates will be searched in $puppetdir/templates, and in $modulepath/templates [06:24:18] <_joe_> for the specific CLI switches, use the man luke! (as I don't remember) [06:24:28] <_joe_> pretty sure the latter is --modulepath, but check it [06:25:22] (03PS3) 10Giuseppe Lavagetto: mediawiki: make HAT appservers a separate cluster in ganglia [puppet] - 10https://gerrit.wikimedia.org/r/160624 [06:25:43] (03CR) 10Giuseppe Lavagetto: [C: 031] "http://puppet-compiler.wmflabs.org//350/change/160624/html" [puppet] - 10https://gerrit.wikimedia.org/r/160624 (owner: 10Giuseppe Lavagetto) [06:26:27] (03PS1) 10Springle: repool db1039 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160918 [06:26:48] (03CR) 10Springle: [C: 032] repool db1039 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160918 (owner: 10Springle) [06:26:50] --templatedir [06:26:56] (03Merged) 10jenkins-bot: repool db1039 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160918 (owner: 10Springle) [06:26:58] thx [06:27:23] (03CR) 10Alexandros Kosiaris: "Curious as to why. nickel does not initiate connections to machines, does it ?" [puppet] - 10https://gerrit.wikimedia.org/r/160802 (owner: 10Ottomata) [06:27:30] !log springle Synchronized wmf-config/db-eqiad.php: repool s7 db1039 (duration: 00m 08s) [06:27:35] Logged the message, Master [06:27:42] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:28:22] <_joe_> that's bogus [06:28:22] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Epic puppet fail [06:28:24] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Epic puppet fail [06:28:32] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Epic puppet fail [06:28:32] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: Epic puppet fail [06:28:33] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.797 second response time [06:28:36] <_joe_> or at least, it's ok no [06:28:44] PROBLEM - puppet last run on search1007 is CRITICAL: CRITICAL: Epic puppet fail [06:28:47] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:54] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Epic puppet fail [06:28:57] <_joe_> oh it's mod_passenger time [06:29:06] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: Epic puppet fail [06:29:07] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Epic puppet fail [06:29:21] (03CR) 10Jkrauska: "Ordering problem...?" [puppet] - 10https://gerrit.wikimedia.org/r/156322 (owner: 10Alexandros Kosiaris) [06:29:36] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:43] cajoel: gimme a sec, uploading a new way better change [06:29:59] * akosiaris_ 3rd day that my bouncer box is down :-( [06:30:06] akosiaris: found a minor ordering dependancy [06:30:17] got one too [06:30:27] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:27] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:37] dpkg won't install slapd when it finds a slapd.conf [06:30:40] ok [06:30:44] <_joe_> akosiaris_: I can lend you one [06:30:46] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:47] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:47] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:16] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:26] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:27] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:27] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:31] <_joe_> mh [06:31:37] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on db1004 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:39] (03CR) 10Jkrauska: "require => Package['slapd']," [puppet] - 10https://gerrit.wikimedia.org/r/156322 (owner: 10Alexandros Kosiaris) [06:32:53] gotta go put kid back to sleep [06:33:04] akosiaris: email me with details? signing off [06:33:56] cajoel: ok. have a nice sleep [06:33:57] (03PS2) 10Alexandros Kosiaris: WIP: openldap module [puppet] - 10https://gerrit.wikimedia.org/r/156322 [06:34:05] cajoel: ^ there you go [06:34:21] might be back--depends on the kiddo [06:35:12] schema are'n't part of a package? [06:35:42] yeah but not gonna install samba just to get a schema file [06:36:13] <_joe_> definitely not [06:36:23] is there a simple (tar this gerrit up and downlaod it link?) [06:36:27] I've found per file downloads [06:36:29] crap [06:36:33] baby monitor wins [06:36:34] ttyl [06:36:53] signing out too, going to the gym [06:37:16] <_joe_> bye to both of you [06:45:06] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:47] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:06] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:46] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:50] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 12 hours old. [06:47:06] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:47:07] RECOVERY - puppet last run on search1007 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:47:16] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:47:26] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:47:26] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:47:47] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:47:48] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 0 hours old. [06:47:56] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:48:06] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:49:36] RECOVERY - puppet last run on db1004 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:51:36] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Puppet has 2 failures [06:51:47] (03PS1) 10Springle: depool s1 db1061 for codfw cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160921 [06:52:05] (03CR) 10Springle: [C: 032] depool s1 db1061 for codfw cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160921 (owner: 10Springle) [06:52:10] (03Merged) 10jenkins-bot: depool s1 db1061 for codfw cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160921 (owner: 10Springle) [06:52:46] !log springle Synchronized wmf-config/db-eqiad.php: depool s1 db1061 for codfw cloning (duration: 00m 07s) [06:52:52] Logged the message, Master [06:52:52] and we're back [06:53:10] and alex goes to the gym [06:55:35] !log xtrabackup clone db1061 to db2016 [06:55:41] Logged the message, Master [06:58:33] where are gerrit credentials pulled from? [06:59:21] seems I cannot login to gerrit... [07:08:17] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:50:38] (03CR) 10Filippo Giunchedi: [C: 031] tendril.wm.org - move behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [07:57:10] (03CR) 10Springle: [C: 031] tendril.wm.org - move behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [07:58:44] <_joe_> mmmh are you sure moving a monitoring host behind a layer of indirection is a good idea? [07:58:47] <_joe_> I don't [08:00:00] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I think putting any monitoring service behind an indirection layer is a poor decision: availablity of monitoring should only be influenced" [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [08:08:32] (03PS2) 10Filippo Giunchedi: wikimedia.org: clarify labsconsole CNAME [dns] - 10https://gerrit.wikimedia.org/r/160454 [08:09:25] (03CR) 10Filippo Giunchedi: "no real fixes but cleanup, I don't have a strong opinion either way, what about the last PS?" [dns] - 10https://gerrit.wikimedia.org/r/160454 (owner: 10Filippo Giunchedi) [08:30:30] (03PS1) 10Giuseppe Lavagetto: puppet: introduce hiera for production [puppet] - 10https://gerrit.wikimedia.org/r/160924 [08:37:30] (03PS1) 10Filippo Giunchedi: metrics: point to misc-web-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/160925 [08:43:02] (03PS1) 10Filippo Giunchedi: metrics: move from stat1001 to varnish [puppet] - 10https://gerrit.wikimedia.org/r/160926 [08:43:04] (03PS1) 10Filippo Giunchedi: metrics: disable SSL virtualhost and cert [puppet] - 10https://gerrit.wikimedia.org/r/160927 [08:45:22] (03Abandoned) 10Filippo Giunchedi: move metrics.wm.o and metrics-api.wm.o behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/160419 (owner: 10Filippo Giunchedi) [08:45:42] PROBLEM - puppet last run on fenari is CRITICAL: CRITICAL: Puppet has 1 failures [08:45:44] moaaaar misc-web ! [08:46:46] <_joe_> mmh I'm not a fan honestly [08:46:58] <_joe_> but metrics is a good fit (maybe) [08:52:48] yeah it is really a redirect [08:52:54] why not a fan btw _joe_ ? [08:53:38] <_joe_> godog: for monitoring, I feel the less moving parts, the better [08:53:45] <_joe_> but metrics is not really monitoring [08:53:59] <_joe_> so it's a good fit, maybe [08:54:23] <_joe_> (hence my -2 to moving tendril behind misc-web) [08:56:00] (03PS2) 10Alexandros Kosiaris: Removal of all snmptrap functionality [puppet] - 10https://gerrit.wikimedia.org/r/159286 [08:56:20] <_joe_> \o/ [08:56:25] yep for monitoring I tend to agree, even though not being able to access the interface doesn't impact functionality [08:56:37] (03PS9) 10Alexandros Kosiaris: Introducing Service Cluster A, hosting mathoid [puppet] - 10https://gerrit.wikimedia.org/r/156576 (https://bugzilla.wikimedia.org/69990) (owner: 10Physikerwelt) [08:56:55] <_joe_> +10 why is that not merged on smptrap [08:57:14] (03CR) 10Alexandros Kosiaris: [C: 032] Removal of all snmptrap functionality [puppet] - 10https://gerrit.wikimedia.org/r/159286 (owner: 10Alexandros Kosiaris) [08:57:18] (03CR) 10Giuseppe Lavagetto: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/159286 (owner: 10Alexandros Kosiaris) [08:57:31] (03CR) 10Alexandros Kosiaris: [C: 032] Introducing Service Cluster A, hosting mathoid [puppet] - 10https://gerrit.wikimedia.org/r/156576 (https://bugzilla.wikimedia.org/69990) (owner: 10Physikerwelt) [09:01:01] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Epic puppet fail [09:01:38] (03PS1) 10Alexandros Kosiaris: mathoid ganglia cluster renamed to sca [puppet] - 10https://gerrit.wikimedia.org/r/160932 [09:01:52] PROBLEM - puppet last run on sca1002 is CRITICAL: CRITICAL: Epic puppet fail [09:02:52] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 5 failures [09:02:55] (03CR) 10Alexandros Kosiaris: [C: 032] mathoid ganglia cluster renamed to sca [puppet] - 10https://gerrit.wikimedia.org/r/160932 (owner: 10Alexandros Kosiaris) [09:04:48] <_joe_> sca1001? [09:04:51] <_joe_> this is new [09:05:11] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:06:29] _joe_: yeah. Service Cluster A [09:06:34] super obvious :D [09:07:17] so if we ever move a host to service cluster b, we change the hostname? [09:07:52] I would have named them svc#### but hey that is just me [09:07:58] <_joe_> "scab" [09:08:18] can't wait for cluster P [09:10:03] mmhh ldaplist -l group wmf on sanger yields "password incorrect" (both as my user and as root), known issue? [09:10:49] that is, without prompting for a password which makes sense running from root [09:13:22] PROBLEM - SSH on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:14:41] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:15:13] PROBLEM - nutcracker process on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:15] PROBLEM - check if dhclient is running on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:16] PROBLEM - Disk space on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:53] godog: sanger ? [09:15:59] why sanger ? [09:16:11] RECOVERY - nutcracker process on fenari is OK: PROCS OK: 1 process with UID = 116 (nutcracker), command name nutcracker [09:16:11] RECOVERY - check if dhclient is running on fenari is OK: PROCS OK: 0 processes with command name dhclient [09:16:11] RECOVERY - Disk space on fenari is OK: DISK OK [09:18:33] akosiaris: I was looking at https://wikitech.wikimedia.org/wiki/LDAP#LDAP_in_Production from https://wikitech.wikimedia.org/wiki/RT_Triage_Duty#LDAP_group_changes [09:19:33] RECOVERY - SSH on fenari is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:20:07] godog: sanger is the OIT LDAP mirror, not really meant to be modified [09:20:19] OIT LDAP mirror != labs LDAP [09:20:47] 2 different LDAPs... the OIT LDAP mirror is only used for cheap rcpt to checks [09:22:28] akosiaris: ah ok, that makes more sense so virt1000 or virt0 I guess [09:22:55] PROBLEM - check configured eth on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:23:45] RECOVERY - check configured eth on fenari is OK: NRPE: Unable to read output [09:24:54] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [09:27:19] (03CR) 10Springle: "Tendril is more about metrics and inventory than monitoring, but it's certainly a nebulous distinction." [puppet] - 10https://gerrit.wikimedia.org/r/160823 (owner: 10Dzahn) [09:27:55] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:27:56] _joe_: ^ not disaggreeing with you ;) maybe this sort of thing needs discussion in general [09:28:38] <_joe_> that was basically my point [09:28:44] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:45] RECOVERY - DPKG on fenari is OK: All packages OK [09:30:04] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [09:38:55] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:41:15] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:42:14] PROBLEM - check configured eth on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:43:55] PROBLEM - nutcracker port on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:44:52] <_joe_> sigh, fenari [09:44:54] RECOVERY - nutcracker port on fenari is OK: TCP OK - 0.000 second response time on port 11212 [09:44:54] RECOVERY - DPKG on fenari is OK: All packages OK [09:45:03] yeah, probably again on swap [09:45:08] I am logging in now [09:45:14] RECOVERY - check configured eth on fenari is OK: NRPE: Unable to read output [09:47:29] (03PS1) 10Alexandros Kosiaris: mathoid: Remove duplicate resource and fix docs [puppet] - 10https://gerrit.wikimedia.org/r/160938 [09:47:54] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:59] (03CR) 10Alexandros Kosiaris: [C: 032] mathoid: Remove duplicate resource and fix docs [puppet] - 10https://gerrit.wikimedia.org/r/160938 (owner: 10Alexandros Kosiaris) [09:49:14] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [09:49:54] RECOVERY - DPKG on fenari is OK: All packages OK [09:49:59] (03PS1) 10Springle: repool db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160939 [09:50:55] poor fenari [09:51:13] seems like puppet + whatever that apache was doing did not treat him well [09:51:32] (03CR) 10Springle: [C: 032] repool db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160939 (owner: 10Springle) [09:51:36] (03Merged) 10jenkins-bot: repool db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160939 (owner: 10Springle) [09:52:14] !log springle Synchronized wmf-config/db-eqiad.php: repool s1 db1061 (duration: 00m 08s) [09:52:21] Logged the message, Master [09:52:56] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:52:59] !log stopped apache2 on fenari, it was leaking memory, puppet restarted it, need to kill this machine ASAP [09:53:05] Logged the message, Master [09:53:14] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4775 bytes in 0.103 second response time [09:55:24] I love it that as soon we actively started moving stuff off fenari it fell over, throwing its toys out of the pram [09:55:57] in my $DAYJOB-1 we had a very very old solaris box that finally we were about to decomission [09:56:24] so there is an RT about doing it Monday morning [09:56:43] the box had 0 services, it was all about a last check and shutting it down [09:57:04] and Sunday night it decided to die [09:57:15] the rest of the day was all about harakiri jokes ... [09:57:52] hehehe [09:58:33] (03PS1) 10Alexandros Kosiaris: mathoid: create /srv/deployment [puppet] - 10https://gerrit.wikimedia.org/r/160940 [09:58:46] wonder if notpeter will want fenari too, to go with db9 [09:59:13] huh ? [09:59:30] did notpeter do anything with db9 ? can't remember [09:59:51] (03CR) 10Alexandros Kosiaris: [C: 032] mathoid: create /srv/deployment [puppet] - 10https://gerrit.wikimedia.org/r/160940 (owner: 10Alexandros Kosiaris) [10:03:54] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [10:04:05] yey! [10:04:36] RECOVERY - puppet last run on sca1002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:04:37] I think I am going to ulimit apache2 on fenari [10:08:56] whee.. all core dbs replicated to codfw. just external storage to go and pmtpa is yesterday's news [10:09:01] at least for db data [10:09:17] sweet [10:11:20] akosiaris: i recall some discussion, maybe jokingly, about shipping db9 to notpeter. sentimentality :) [10:19:11] godog: thx for the new jenkins package. Will brea^H^H^H^Hupgrade it this afternoon [10:20:59] hashar: haha no worries, simple enough :) [10:22:41] commuting back home for lunch [10:24:33] matanya: there was this about the broken puppet compiler job, fixed it seems? https://rt.wikimedia.org/Ticket/Display.html?id=8051 [10:25:11] godog: iirc _joe_fixed it [10:27:41] not sure what "broken" meant in the first place :) [10:37:55] <_joe_> godog: "not working" [10:40:06] _joe_: as in the jenkins job would fail entirely? [10:41:20] or it won't run at all for example? many shades (50?) of "not working" as usual [10:48:20] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:09] <_joe_> godog: yep [10:54:50] _joe_: which of the two options? :) [10:56:38] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:57:21] akosiaris: you broke netmon1001 badly [10:57:42] requiring the snmp package to be purged from all machines is perhaps not the best thing to do :) [10:57:52] hmmm [10:58:02] ok fixing [10:58:06] thank you :) [11:01:35] (03PS3) 10Alexandros Kosiaris: Remove the snmptt user [puppet] - 10https://gerrit.wikimedia.org/r/143305 [11:03:58] (03CR) 10Alexandros Kosiaris: [C: 032] Remove the snmptt user [puppet] - 10https://gerrit.wikimedia.org/r/143305 (owner: 10Alexandros Kosiaris) [11:24:24] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: Epic puppet fail [11:24:54] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:46] PROBLEM - puppet last run on db74 is CRITICAL: CRITICAL: Epic puppet fail [11:26:47] (03PS1) 10Alexandros Kosiaris: librenms requires snmp [puppet] - 10https://gerrit.wikimedia.org/r/160944 [11:27:13] mark: fixed manually, seems like the only thing that directly called snmpget/snmpwalk et al. Which is ew, at least all other tools use libsnmp or some other binding. Puppet fix in https://gerrit.wikimedia.org/r/160944, but don't merge yet (I want the rest of the cluster to purge snmp before that) [11:27:59] ok [11:28:14] * akosiaris remembers the require_packages discussion the other day [11:29:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Wait for https://gerrit.wikimedia.org/r/#/c/143306/ to be merged" [puppet] - 10https://gerrit.wikimedia.org/r/160944 (owner: 10Alexandros Kosiaris) [11:37:40] <_joe_> akosiaris: which part? [11:38:29] the avoiding multiple definitions and ensure => absent vs ensure => present [11:38:37] <_joe_> eheh [11:38:47] * _joe_ whistles [11:43:22] (03PS1) 10coren: Labs: Fix generation of meta_p.meta [software] - 10https://gerrit.wikimedia.org/r/160945 (https://bugzilla.wikimedia.org/54962) [11:43:34] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:44:05] RECOVERY - puppet last run on db74 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [11:44:54] (03CR) 10coren: [C: 032 V: 032] "(Reflects current version)" [software] - 10https://gerrit.wikimedia.org/r/160945 (https://bugzilla.wikimedia.org/54962) (owner: 10coren) [11:58:27] (03PS3) 10Alexandros Kosiaris: Remove the last resources of snmp on hosts [puppet] - 10https://gerrit.wikimedia.org/r/143306 [12:00:32] (03CR) 10Alexandros Kosiaris: [C: 032] Remove the last resources of snmp on hosts [puppet] - 10https://gerrit.wikimedia.org/r/143306 (owner: 10Alexandros Kosiaris) [12:00:48] (03CR) 10Alexandros Kosiaris: [C: 032] librenms requires snmp [puppet] - 10https://gerrit.wikimedia.org/r/160944 (owner: 10Alexandros Kosiaris) [12:20:17] !log upgrading jenkins 1.565.1 -> 1.565.2 [12:20:23] Logged the message, Master [12:20:55] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:22:27]