[00:01:53] (03PS10) 10Andrew Bogott: Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 [00:01:55] (03PS1) 10Ori.livneh: Use Debian-packaged texvc on Trusty app servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162795 (https://bugzilla.wikimedia.org/71224) [00:02:18] ^ AaronSchulz [00:06:21] (03CR) 10Andrew Bogott: Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 (owner: 10Andrew Bogott) [00:08:47] (03CR) 10Aaron Schulz: [C: 031] Use Debian-packaged texvc on Trusty app servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162795 (https://bugzilla.wikimedia.org/71224) (owner: 10Ori.livneh) [00:30:13] (03CR) 10Jdlrobson: [C: 031] Add wikidatawiki to wgAppleTouchIcon and add wikidata.png to bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162638 (https://bugzilla.wikimedia.org/70996) (owner: 10Glaisher) [00:35:38] (03PS2) 10Dzahn: create shell account for nettrom [puppet] - 10https://gerrit.wikimedia.org/r/162192 [00:36:13] (03CR) 10Dzahn: [C: 032] create shell account for nettrom [puppet] - 10https://gerrit.wikimedia.org/r/162192 (owner: 10Dzahn) [00:41:21] (03PS2) 10Dzahn: add nettrom to various statistic/analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/162193 [00:44:45] (03CR) 10Dzahn: [C: 032] add nettrom to various statistic/analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/162193 (owner: 10Dzahn) [00:47:21] (03CR) 10Dzahn: "https://rt.wikimedia.org/Ticket/Display.html?id=8343" [puppet] - 10https://gerrit.wikimedia.org/r/162193 (owner: 10Dzahn) [00:55:29] !log icinga - manually deleted duplicate host labs-ns1 to fix icinga config and reloads [00:56:10] (because even if puppet re-adds it i can have my unrelated stuff work :p) [01:00:40] PROBLEM - Certificate expiration on neptunium is CRITICAL: SSL_CERT CRITICAL ldap-eqiad.wikimedia.org: invalid CN (ldap-eqiad.wikimedia.org does not match neptunium.wikimedia.org) [01:02:45] ACKNOWLEDGEMENT - mathoid on sca1001 is CRITICAL: Connection refused daniel_zahn RoanKattouw: [01:02:46] ACKNOWLEDGEMENT - mathoid on sca1002 is CRITICAL: Connection refused daniel_zahn RoanKattouw: [01:05:46] ACKNOWLEDGEMENT - Certificate expiration on labcontrol2001 is CRITICAL: SSL_CERT CRITICAL ldap-codfw.wikimedia.org: invalid CN (ldap-codfw.wikimedia.org does not match labcontrol2001.wikimedia.org) daniel_zahn foo bar baz [01:05:46] ACKNOWLEDGEMENT - Certificate expiration on neptunium is CRITICAL: SSL_CERT CRITICAL ldap-eqiad.wikimedia.org: invalid CN (ldap-eqiad.wikimedia.org does not match neptunium.wikimedia.org) daniel_zahn foo bar baz [01:07:40] ACKNOWLEDGEMENT - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 0 below the confidence bounds daniel_zahn since 5 days and check is largely ignored [01:07:40] ACKNOWLEDGEMENT - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 0 below the confidence bounds daniel_zahn since 5 days and check is largely ignored [01:09:00] ACKNOWLEDGEMENT - Host mathoid.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn duration 5 days [01:09:03] RoanKattouw_away: ^ [01:16:50] i wish it paged the ack when it paged an alert [01:18:18] (03CR) 1020after4: [C: 031] T458: Rename ext_ref description and hide it from users [puppet] - 10https://gerrit.wikimedia.org/r/162161 (owner: 10Chad) [01:22:37] (03PS3) 10Ori.livneh: redirect wikimania.org/.com to wikimania2015 [puppet] - 10https://gerrit.wikimedia.org/r/161405 (owner: 10Dzahn) [01:22:52] (03CR) 10Ori.livneh: [C: 032 V: 032] "Deploying this jointly with Dzahn" [puppet] - 10https://gerrit.wikimedia.org/r/161405 (owner: 10Dzahn) [01:31:09] (03PS1) 10Chmarkine: phabricator - enable HSTS with max-age 7 days [puppet] - 10https://gerrit.wikimedia.org/r/162805 [01:33:06] (03PS2) 10Chmarkine: phabricator - enable HSTS with max-age 7 days [puppet] - 10https://gerrit.wikimedia.org/r/162805 (https://bugzilla.wikimedia.org/38516) [01:33:31] (03PS1) 10Mattflaschen: Extend GettingStarted bucketting period to Sept. 28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162806 [01:34:48] (03CR) 10Mattflaschen: [C: 04-1] "We're going to ask when we can deploy this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162806 (owner: 10Mattflaschen) [01:35:05] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:35:47] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.257 second response time [01:39:10] !log gracefuling apaches [01:53:44] (03PS1) 10Ori.livneh: Apply the app server role on mw1022 [puppet] - 10https://gerrit.wikimedia.org/r/162810 [01:54:13] (03PS1) 10Catrope: Follouwp 6084646d: apply directory creation hack to labs too [puppet] - 10https://gerrit.wikimedia.org/r/162811 [01:54:34] (03PS2) 10Catrope: Follouwp 6084646d: apply Mathoid directory creation hack to labs too [puppet] - 10https://gerrit.wikimedia.org/r/162811 [01:54:47] (03CR) 10Dzahn: [C: 031] "yep, mw1022 did not pick up changes when we deployed apache config change, regex was off-by-one" [puppet] - 10https://gerrit.wikimedia.org/r/162810 (owner: 10Ori.livneh) [01:55:10] (03CR) 10Dzahn: [C: 032] "yep, mw1022 did not pick up changes when we deployed apache config change, regex was off-by-one" [puppet] - 10https://gerrit.wikimedia.org/r/162810 (owner: 10Ori.livneh) [01:55:16] (03CR) 10jenkins-bot: [V: 04-1] Follouwp 6084646d: apply Mathoid directory creation hack to labs too [puppet] - 10https://gerrit.wikimedia.org/r/162811 (owner: 10Catrope) [01:57:07] (03PS3) 10Catrope: Followup 6084646d: apply directory creation hack to labs too [puppet] - 10https://gerrit.wikimedia.org/r/162811 [01:57:35] (03PS4) 10Catrope: Followup 6084646d: apply Mathoid directory creation hack to labs too [puppet] - 10https://gerrit.wikimedia.org/r/162811 [02:04:13] ori: now [02:04:15] - cluster: misc [02:04:15] + cluster: appserver [02:04:36] aha [02:04:47] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [02:05:05] and we got the Apache change too , yep [02:05:39] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3875 MB (3% inode=99%): [02:08:16] (03CR) 10Ori.livneh: [C: 032] Use Debian-packaged texvc on Trusty app servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162795 (https://bugzilla.wikimedia.org/71224) (owner: 10Ori.livneh) [02:08:22] (03Merged) 10jenkins-bot: Use Debian-packaged texvc on Trusty app servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162795 (https://bugzilla.wikimedia.org/71224) (owner: 10Ori.livneh) [02:08:57] !log ori Synchronized wmf-config/CommonSettings.php: Use Debian-packaged texvc on Trusty app servers (duration: 00m 04s) [02:19:07] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: Puppet has 1 failures [02:32:07] PROBLEM - puppet last run on mw1159 is CRITICAL: CRITICAL: Puppet has 1 failures [02:32:10] PROBLEM - puppet last run on mw1190 is CRITICAL: CRITICAL: Puppet has 1 failures [02:32:56] !log LocalisationUpdate completed (1.24wmf21) at 2014-09-25 02:32:56+00:00 [02:33:18] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet has 1 failures [02:34:06] (03PS1) 10Dzahn: pdf servers - remove from dsh,dhcp,ganglia [puppet] - 10https://gerrit.wikimedia.org/r/162814 [02:36:08] RECOVERY - puppet last run on mw1159 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [02:37:27] RECOVERY - puppet last run on mw1047 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [02:49:17] RECOVERY - puppet last run on mw1190 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [03:00:17] RECOVERY - Disk space on virt0 is OK: DISK OK [03:02:47] !log LocalisationUpdate completed (1.24wmf22) at 2014-09-25 03:02:46+00:00 [03:10:38] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [03:58:02] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Sep 25 03:58:02 UTC 2014 (duration 58m 1s) [04:08:18] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [04:22:08] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:22:17] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:22:58] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.008 second response time [05:48:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [06:01:17] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [06:12:31] (03PS1) 10Springle: depool db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162832 [06:12:55] (03CR) 10Springle: [C: 032] depool db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162832 (owner: 10Springle) [06:13:00] (03Merged) 10jenkins-bot: depool db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162832 (owner: 10Springle) [06:13:19] !log springle Synchronized wmf-config/db-eqiad.php: depool db1062 (duration: 00m 07s) [06:28:08] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Epic puppet fail [06:28:17] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:28] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Epic puppet fail [06:28:37] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Epic puppet fail [06:30:07] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:07] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:59] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:08] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:28] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:29] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:29] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:30] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:30] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:31] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:39] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:39] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:18] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:07] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:07] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:45:17] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:27] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:45:28] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:45:47] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:47:10] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:47:10] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:47:17] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:47:18] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [06:47:37] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:51:07] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Puppet has 2 failures [06:53:01] (03CR) 10Zfilipin: [C: 031] contint: labs slaves +mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/162604 (https://bugzilla.wikimedia.org/69535) (owner: 10Hashar) [07:09:17] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:17:49] YuviPanda: grr: [07:17:51] if ($::realm == 'labs') { [07:17:51] # Mount extra disk on /srv so carbon has somewhere to store metrics [07:17:53] require role::labs::lvm::srv [07:17:55] } [07:18:02] what is that doing in role::graphite::base? [07:30:07] (03PS2) 10Ori.livneh: Graphite: enable CORS for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/162559 [07:30:31] godog: yt? [07:30:50] (03CR) 10jenkins-bot: [V: 04-1] Graphite: enable CORS for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/162559 (owner: 10Ori.livneh) [07:31:12] PEP8? really? gah. [07:38:38] (03PS3) 10Ori.livneh: Graphite: enable CORS for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/162559 [07:42:13] (03CR) 10Ori.livneh: "PS2-3: manage CORS in the webapp via a Django middleware" [puppet] - 10https://gerrit.wikimedia.org/r/162559 (owner: 10Ori.livneh) [07:43:25] * ori is pretty pleased with how that patch turned out. [07:43:39] <_joe_> django to just manage cors? [07:43:47] graphite-web is a django webapp [07:44:02] so not introducing django to manage cors, just managing cors by adding a small django middleware [07:44:03] <_joe_> oh ok just a middleware for the main app [07:44:05] yep [07:44:08] <_joe_> eheh yes [07:45:34] <_joe_> it's funny to have a web framework named after a spaghetti western [07:48:55] heh [07:49:24] speaking of funning, uwsgi is increasingly becoming something like http://www.templeos.org/ [07:49:52] it's kind of amazing, i think possibly the author of uwsgi is a genius and this is his idea of satire [07:50:10] <_joe_> uwsgi used to be very very good [07:50:25] <_joe_> I loved its architecture, but I'm not updated to the last evolutions [07:50:33] <_joe_> (what is templeos btw?) [07:50:41] templeos is amazing [07:51:57] it's an x86_64 os written by a brilliant systems engineer who is also a paranoid schizophrenic [07:52:06] who believes that god is talking to him [07:52:25] <_joe_> oh so it's serious? [07:52:35] <_joe_> ori: what you don't like about uwsgi? [07:52:37] yes, he's actually a nice guy and amazingly creative [07:52:59] _joe_: the sprawl! [07:54:48] <_joe_> ori: my wife would joke that most brilliant systems engineers are [07:55:20] <_joe_> (both paranodi schizophrenics, and nice guys [07:55:23] http://uwsgi-docs.readthedocs.org/en/latest/Broodlord.html#auto-scaling-with-broodlord-mode [07:55:29] "The [zerg] stanza is the config the Emperor will run when a vassal requires resources. " what? [07:55:38] <_joe_> ahahahah [07:55:41] <_joe_> LOL [07:57:05] ah Zerg [07:57:10] I am missing Starcraft now [07:57:14] and the Korean dudes [07:57:21] kkkkkkk [07:57:29] gg [07:57:32] <_joe_> never even looked at Starcraft [07:57:37] come on [07:57:55] b3st g4m3 3v3r! [07:57:56] _joe_: i'm not a gamer at all but starcraft is required for literacy [07:58:03] like dune 2 [07:58:06] +1 [07:58:15] <_joe_> mmmh I thought Balzac was [07:58:16] ori: there is a multiplayer dune 2 version floating around iirc [07:58:25] <_joe_> Or Tolstoj [07:58:30] also, how do you expect to manage uwsgi deployments without starcraft knowledge? [07:58:38] _joe_: that is for legal department, not engineering [07:58:56] <_joe_> ori: right; for useless pop-culture references, I have wikipedia! [08:00:02] " Starting from uWSGI 1.3-dev, a customizable secondary :term:`harakiri` subsystem has been added." [08:00:34] <_joe_> ori: they're just a bunch of devs having fun [08:00:49] <_joe_> and not entrapped by corporatesque seriosity [08:01:24] * springle reads that phrase twice [08:02:08] gl hf dd [08:02:19] <_joe_> springle: mmmh that sounds wrong? [08:02:26] rfarrand: :D [08:03:06] gg BBQed [08:04:29] _joe_: here's a taste: http://www.templeos.org/Wb/Accts/TS/Wb2/WalkThru.html [08:05:45] oh my the poor beta cluster has too many puppet patches :/ [08:06:19] "I was about to do different graphic modes when I found 800x600 missing. God said just one mode 640x480. I was about to add child windows. God said, "God is not the author of such confusion." I asked for verification of 640x480 16 color. God said it was because of the children and their offerings. I asked about sound. God said "single voice". I asked for verification of not having different drivers. God confirmed this." [08:06:44] the great thing about this quote is that it could have just as easily come from either the uwsgi docs or the templeos docs [08:07:02] (it's templeos: http://www.templeos.org/Wb/Adam/God/HSNotes.html#l1) [08:07:57] oh, sorry, last link: https://www.youtube.com/watch?v=1okW1RTPZ7Q [08:09:12] (03CR) 10Hashar: "Patch broke beta cluster which is fixed with https://gerrit.wikimedia.org/r/#/c/148371/ "Beta: fill missing $lvs_service_ips['ocg']"" [puppet] - 10https://gerrit.wikimedia.org/r/146860 (owner: 10BBlack) [08:09:25] ori: beta has a live hack that reverts LVS config for OCG https://gerrit.wikimedia.org/r/#/c/146860/ , turns out it got fixed later on [08:09:28] ori: so I am removing it :] [08:09:51] nice [08:10:10] RoanKattouw_away sent me an email noticing puppet is lagged out on beta [08:16:18] _joe_: I might have messed up some hiera related patch on beta cluster :/ [08:16:39] _joe_: namely https://gerrit.wikimedia.org/r/#/c/151869/ "" puppet: hiera backend for the WMF "" [08:19:14] <_joe_> hashar: uh? [08:19:35] <_joe_> hashar: can you elaborate on that? [08:19:36] _joe_: the beta puppetmaster apparently had an obsolete version of that change [08:19:42] <_joe_> yes. [08:19:45] <_joe_> remove it! [08:19:53] so on rebasing it, it just had a small change to the prod.yaml file [08:19:57] <_joe_> I can do it for you if you want [08:20:01] did :] [08:20:06] <_joe_> ok, sorry [08:20:13] beta puppetmaster is all fine [08:21:01] _joe_: we will have to talk together about hiera and beta next week. We would like to setup a second cluster and I guess that needs playing with hiera :-] [08:21:48] <_joe_> hashar: hiera would help reducing beta discrepancies with prod as well I guess [08:22:11] definitely, bye-bye if $::realm == 'labs' and the evil configuration hashes [08:22:29] <_joe_> well, the config hashes will still be there [08:22:30] <_joe_> :) [08:23:07] well the one for varnish will have to be adjusted cause we would need to vary by $::realm and $::labsprojectname (or something like that) [08:23:31] since the app servers will have different IP in the two labs project [08:23:33] <_joe_> and we can do that [08:23:45] <_joe_> it's already supported in hiera [08:23:48] <_joe_> ;) [08:24:17] <_joe_> ori: I'm including upstream fixes to HHVM-3.3.0 [08:24:56] _joe_: do i have time to dash off another quick patch? [08:25:03] i need 15 mins or so [08:25:05] <_joe_> of course [08:25:09] <_joe_> take your time [08:25:13] thanks [08:25:19] <_joe_> rebasing on upstream is usually painful [08:25:21] _joe_: the pity is that any change will impact prod as well -:D [08:25:38] <_joe_> hashar: cherry-pick is there for you [08:28:07] _joe_: humm new package of hhvm coming? [08:28:29] <_joe_> Nikerabbit: we have one from yesterday [08:28:40] <_joe_> there will be another one by the end of the european morning [08:28:44] <_joe_> hopefully [08:29:08] <_joe_> or "just this patch and I'll sleep" time in ori's timezone [08:29:24] haha it's soon noon here [08:29:26] we'll see [08:31:51] (03CR) 10Hashar: [C: 04-1] "Apache doc for %O at https://httpd.apache.org/docs/2.2/mod/mod_log_config.html states you need to enable apache module mod_logio and I hav" [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb) [08:32:28] could someone with sysop rights on wikitech please revert all actions of https://wikitech.wikimedia.org/wiki/Special:Contributions/Skins1 and block the vandal? [08:32:42] _joe_: should we get hhvm ensure => latest on labs realm ? [08:33:01] <_joe_> hashar: not really [08:36:01] _joe_: I am afraid I will forget to update hhvm package on the integration labs project (which is used by the hhvm related Jenkins jobs) [08:36:52] <_joe_> hashar: ok then... be my guest [08:37:00] <_joe_> sorry I'm in the middle of patching hhvm [08:41:09] (03PS1) 10Ori.livneh: HHVM: update JIT settings [puppet] - 10https://gerrit.wikimedia.org/r/162839 [08:42:19] (03CR) 10JanZerebecki: [C: 031] phabricator - enable HSTS with max-age 7 days [puppet] - 10https://gerrit.wikimedia.org/r/162805 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [08:42:53] <_joe_> ori: oh srsly? [08:43:05] ori: and once you have slept enough, CI might need a custom hhvm config file as well :] [08:43:24] srsly what? [08:43:27] <_joe_> hashar: I can help with that [08:43:38] <_joe_> ori: did they change that in 3.3.0 [08:43:49] yes. annoying, i know. [08:43:57] https://github.com/facebook/hhvm/commit/ca99ef1 [08:43:57] <_joe_> GRRR [08:44:06] <_joe_> yeah I was reading right now [08:44:52] (03PS1) 10Giuseppe Lavagetto: Imported Upstream version 3.3.0+20140925 [debs/hhvm] - 10https://gerrit.wikimedia.org/r/162840 [08:45:10] still one more patch! [08:46:01] <_joe_> ori: any patch of yours is going into debian/patches anyways, so... not an issue [08:56:01] nod [08:57:44] (03PS1) 10Giuseppe Lavagetto: Backport PR #3840 [debs/hhvm] - 10https://gerrit.wikimedia.org/r/162841 [09:01:05] ori: sure, still up? :) [09:01:23] godog: obvs. [09:01:36] yeah I don't even know why I asked [09:02:15] godog: b.black confirmed that apache can do the header manipulation; i got it to work sans graphite and then i realized the uwsgi hijacks request processing and therefore we can't use apache [09:02:33] but i came up with a pretty neat solution i think: https://gerrit.wikimedia.org/r/#/c/162559/ [09:03:03] i tested it in vagrant, was going to offer to hack it locally on tungsten for you so you can confirm that it works before merging [09:03:08] but i gotta wrap up some hhvm stuff first [09:03:47] but: think you might have time for that later? [09:04:00] (i am naggy but i take rejection well ;)) [09:04:14] ori: sure I'm taking a look now, I have a "code reviews first thing in the morning" routine anyways :) [09:05:56] <_joe_> godog: that's sane, I usually postpone them for the late afternoon and I always fall short [09:06:32] _joe_: i have the patch. not sure how to generate a patchfile of the exact format [09:06:58] <_joe_> ori: diff output is good [09:07:01] <_joe_> git diff as well [09:07:26] <_joe_> just add the resulting .patch file to debian/patches and submit it, I'll format it [09:07:33] _joe_: yeah I know I'll forget otherwise [09:08:00] _joe_: https://dpaste.de/WFnG/raw -- applies cleanly against your latest patchset [09:08:12] _joe_: would you like me to write a short summary? [09:08:59] 'fix-memcached-increment-decrement.patch' is probably descriptive [09:09:41] <_joe_> yep [09:09:48] <_joe_> ori: a summary is fine [09:09:53] <_joe_> or I can write it no problem [09:10:21] i documented the issue here: https://github.com/facebook/hhvm/issues/3839 [09:10:49] <_joe_> you also described it to me [09:11:10] This works around the problem by ignoring the initial_value and offset arguments and delegating to the appropriate libmemcached functions [09:11:22] which is actually what we want [09:11:37] the only downside is that if we were to switch to the binary protocol tomorrow and start using offset/initial_value it wouldn't work [09:11:54] <_joe_> ori: eh, for now just this bandaid is fine [09:12:02] * ori nods [09:14:26] (03CR) 10Filippo Giunchedi: [C: 031] Graphite: enable CORS for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/162559 (owner: 10Ori.livneh) [09:14:45] ori: cors patch looks good, since it doesn't touch scary bits anymore I can just merge it [09:14:58] WFM :) [09:16:01] grafana + hhvm packages, it's like christmas [09:17:14] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Graphite: enable CORS for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/162559 (owner: 10Ori.livneh) [09:17:55] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [09:18:09] icinga-wm: shush [09:20:39] ori: Notice: /Stage[main]/Graphite::Web/Exec[create_graphite_admin]/returns: CORS_ORIGINS = https?://grafana.wikimedia.org [09:20:48] missing quoting [09:22:08] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 111 bytes in 0.019 second response time [09:22:17] fix incoming, sec [09:22:37] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures [09:22:54] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Imported Upstream version 3.3.0+20140925 [debs/hhvm] - 10https://gerrit.wikimedia.org/r/162840 (owner: 10Giuseppe Lavagetto) [09:22:58] !log graphite temporarily down, fix incoming [09:23:24] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Backport PR #3840 [debs/hhvm] - 10https://gerrit.wikimedia.org/r/162841 (owner: 10Giuseppe Lavagetto) [09:24:23] godog: doh, http://stackoverflow.com/questions/3960392/ruby-1-9-array-to-s-behaves-differently [09:25:26] (03PS1) 10Ori.livneh: Fix-up for Ieebc69411 [puppet] - 10https://gerrit.wikimedia.org/r/162845 [09:25:30] ^ godog [09:25:53] my vagrant vm uses ruby 1.9, hence the mistake :/ [09:26:14] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Fix-up for Ieebc69411 [puppet] - 10https://gerrit.wikimedia.org/r/162845 (owner: 10Ori.livneh) [09:27:18] ori: ack! I'll add that to the list "ruby is perl in the 21st century" list [09:27:54] did minor perl versions introduce such traps? i never stuck with perl long enough [09:29:25] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.065 second response time [09:29:27] possibly not, but equally subtle behaviours [09:29:28] godog: http://grafana.wikimedia.org/ \o/ [09:29:37] wohoo [09:29:45] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:29:45] thanks a ton! [09:29:50] <_joe_> :)) [09:30:01] ori: yw, thanks to you as you did most of the work anyway :) [09:30:49] <_joe_> ori: if you're too exhausted just point me to your hhvm patch, I'll handle it [09:31:03] _joe_: i did earlier [09:31:13] 02:08 _joe_: https://dpaste.de/WFnG/raw -- applies cleanly against your latest patchset [09:31:32] <_joe_> oh I missed it in between all the noise [09:32:58] _joe_: "Work around https://github.com/facebook/hhvm/issues/3839 by silently discarding unsupported (and for Wikimedia, unused) arguments to Memcached::increment & Memcached::decrement" for a summary [09:33:24] <_joe_> ori: LGTM [09:46:04] (03PS1) 10Filippo Giunchedi: install-server: install lldpd early [puppet] - 10https://gerrit.wikimedia.org/r/162847 [09:46:48] (03PS1) 10Giuseppe Lavagetto: fix increment warnings in memcached [debs/hhvm] - 10https://gerrit.wikimedia.org/r/162848 [09:47:12] mark: https://gerrit.wikimedia.org/r/#/c/162847/ the patch to install lldpd early [09:51:44] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] fix increment warnings in memcached [debs/hhvm] - 10https://gerrit.wikimedia.org/r/162848 (owner: 10Giuseppe Lavagetto) [09:53:06] Reedy: can you expand on https://gerrit.wikimedia.org/r/#/c/135544/2 ? [09:58:56] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:05] PROBLEM - puppet last run on amssq52 is CRITICAL: CRITICAL: Epic puppet fail [09:59:16] (03CR) 10Mark Bergsma: [C: 031] install-server: install lldpd early [puppet] - 10https://gerrit.wikimedia.org/r/162847 (owner: 10Filippo Giunchedi) [10:00:59] godog: and from this, we could make a puppet fact that has the row/rack name [10:01:12] at least eqiad/codfw have a very standard scheme [10:01:21] esams/ulsfo are a bit inconsistent though [10:03:35] e.g. if lldp reports that the server is connected to asw-c-eqiad:5/0/23, you can deduce that the server is in rack C5 [10:06:02] mark: indeed, _that_ we can compare automatically to racktables' idea of the world [10:06:08] (03PS1) 10Giuseppe Lavagetto: Version bump [debs/hhvm] - 10https://gerrit.wikimedia.org/r/162851 [10:06:47] and even mention in e.g. the MOTD [10:06:58] "This server resides in rack C5 in eqiad" [10:07:23] there are a few special cases, e.g. LVS, but for 95% of servers that should be fine [10:07:47] cool! [10:08:07] in tampa this didn't work at all which is part of the reason why we haven't done it yet [10:08:10] but that's no longer relevant now [10:08:14] just esams/ulsfo to worry about a bit [10:08:17] esams especially [10:08:25] esams switch ordering isn't even consistent with rack ordering [10:08:26] but ulsfo is easy [10:09:33] mark: nice, I've captured that in RT #8439 [10:09:38] cool [10:13:51] (03CR) 10Filippo Giunchedi: "cron[updatetranslationstats] changed status in the puppet compiler output" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/160232 (owner: 10Ori.livneh) [10:15:19] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Version bump [debs/hhvm] - 10https://gerrit.wikimedia.org/r/162851 (owner: 10Giuseppe Lavagetto) [10:17:15] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:18:31] RECOVERY - puppet last run on amssq52 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:23:06] can someone move https://wikitech.wikimedia.org/wiki/User:Server_Admin_Log back to https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:22] * aude does not have appropriate rights [10:23:57] <_joe_> aude: there is some vandal in action? [10:24:05] looks like Nemo_bis took care of it [10:24:10] yes there was a vandal [10:31:46] !log SAL is here [10:31:54] Logged the message, Master [10:40:52] (03CR) 10Filippo Giunchedi: [C: 031] "I take it this has been applied already at runtime?" [puppet] - 10https://gerrit.wikimedia.org/r/162661 (owner: 10Manybubbles) [10:41:38] (03PS1) 10Yurik: Enable ZeroPortal lua extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162855 [10:42:25] (03PS1) 10Hashar: zuul: client to easily query Gearman server [puppet] - 10https://gerrit.wikimedia.org/r/162856 [10:49:59] (03CR) 10Glaisher: "Yeah; it needs a background. Other files at apple-touch also does. sjoerddebruin says it should also be square-shaped but I don't think it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162638 (https://bugzilla.wikimedia.org/70996) (owner: 10Glaisher) [10:57:02] !log upgraded bash on labsdb1003 [10:57:07] Logged the message, Master [10:59:19] <_joe_> godog: there are a couple more systems left behind [11:00:03] _joe_: I couldn't find any more via salt, which ones? [11:00:12] <_joe_> I updated them earlier [11:00:22] <_joe_> I dunno why I forgot about labsdb [11:00:28] <_joe_> and I didn't log it [11:00:39] <_joe_> obviously now I can't remember which ones [11:00:46] scrollback? [11:00:59] <_joe_> tried that :/ [11:01:13] <_joe_> anyways, one was a db.. [11:02:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] contint: labs slaves +mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/162604 (https://bugzilla.wikimedia.org/69535) (owner: 10Hashar) [11:03:35] <_joe_> godog: and the other one being elastic1007 [11:03:44] <_joe_> !log updated bash on elastic1007 [11:03:50] Logged the message, Master [11:06:58] cool [11:10:37] (03PS1) 10ArielGlenn: fix up ordering for salt-minion package, config, service [puppet] - 10https://gerrit.wikimedia.org/r/162860 [11:20:49] <_joe_> ori: since you're still awake - new hhvm packages available [11:21:18] ori is always awake [11:23:36] _joe_: awesome. shall i upgrade labs? [11:24:57] <_joe_> ori: I'll do that [11:25:12] <_joe_> and if hell does not freeze, I'd update prod as well today [11:25:24] _joe_: <3. thanks. i'll sleep then :P [11:26:06] <_joe_> Nikerabbit: new hhvm packages delivered almost on-time :) [11:26:26] it's the middle of afternoon in Helsinki :P [11:26:42] <_joe_> Nemo_bis: but I still didn't have lunch [11:26:51] <_joe_> so technically it's still morning for me [11:27:14] i am seeing a bunch of "Base lambda function for closure not found" errors [11:27:27] suppose it's ok if i try syncing WikibaseLib.default.php again? [11:27:32] it's apc issue [11:27:57] <_joe_> aude: if it's an apc issue just touching the file would be enough supposedly, yes [11:28:00] ok [11:28:28] only on wmf22 [11:29:37] !log aude Synchronized php-1.24wmf22/extensions/Wikidata/extensions/Wikibase/lib/config/WikibaseLib.default.php: fix apc issues (duration: 00m 06s) [11:29:41] Logged the message, Master [11:29:55] otherwise a graceful for apache would do [11:30:07] if this doesn't work [11:30:35] _joe_: don't forget https://gerrit.wikimedia.org/r/#/c/162839/ tho; otherwise the jit settings will break [11:30:58] <_joe_> yeah [11:31:00] reallr [11:31:02] it's on multiple servers mw1139, mw1199 [11:31:05] mainly [11:31:15] <_joe_> cherry-picking it on beta for now [11:31:27] <_joe_> ori: it sucks we need to coordinate this [11:31:32] can someone apache graceful on those? [11:31:40] or do i have rights for that? [11:31:49] <_joe_> mmmh I can put an ensure => version in the CR [11:31:57] <_joe_> aude: you don't I guess [11:32:01] <_joe_> which servers [11:32:04] <_joe_> ? [11:32:08] mw1139 and mw1199 [11:33:24] <_joe_> !log gracefully reloaded apache on mw1139 and mw1199, apc issues [11:33:29] Logged the message, Master [11:33:31] https://bugs.php.net/bug.php?id=52144 is the issue and think not be issue with hhvm [11:33:34] thanks [11:33:49] seems to have stopped [11:34:30] _joe_: hah, right ;) I'm getting hungry too [11:57:24] (03CR) 10Aude: "i don't think the logo itself should be square shaped but the background should be, afaik" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162638 (https://bugzilla.wikimedia.org/70996) (owner: 10Glaisher) [12:08:56] (03PS1) 10Yuvipanda: icinga: Move icinga web into module [puppet] - 10https://gerrit.wikimedia.org/r/162865 [12:14:51] (03PS2) 10Yuvipanda: icinga: Move icinga web into module [puppet] - 10https://gerrit.wikimedia.org/r/162865 [12:14:53] (03PS1) 10Yuvipanda: icinga: Move logrotate into module [puppet] - 10https://gerrit.wikimedia.org/r/162866 [12:16:58] (03PS1) 10Yuvipanda: icinga: Move user / group setup into module [puppet] - 10https://gerrit.wikimedia.org/r/162867 [12:22:08] PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: Epic puppet fail [12:22:25] wait, icinga has ncsa and nsca? [12:22:28] or is that just a typo? [12:22:30] * YuviPanda investigates [12:22:58] yup [12:22:59] typo [12:23:56] ouch :( [12:24:28] I think I spotted at least a check or hook to catch pmtpa vs ptmpa in puppet [12:32:11] (03PS1) 10Yuvipanda: icinga: Move NSCA code into module [puppet] - 10https://gerrit.wikimedia.org/r/162870 [12:33:21] (03PS2) 10Yuvipanda: icinga: Move NSCA code into module [puppet] - 10https://gerrit.wikimedia.org/r/162870 [12:33:22] godog: heh :) there's the 'typos' file, but it doesn't really check for anything [12:33:28] godog: also icinga.pp is now < 500 lines, yay! :) [12:34:08] YuviPanda: \o/ happy days, btw no I don't remember where I came across that check [12:34:35] godog: heh :) [12:36:23] !log update bash on elastic1014 analytics1021 elastic1013 [12:36:28] Logged the message, Master [12:36:46] _joe_: the reason I missed some before is what I wrote in the salt thread btw [12:37:05] <_joe_> godog: of course [12:37:17] <_joe_> godog: I guess there should be a way to fix that [12:37:33] <_joe_> also, upgrading to newer salt versions could help [12:38:02] * YuviPanda wonders if the salt thread had concluded [12:39:37] <_joe_> YuviPanda: when it extinguishes, I'll write 'A critique of puppet' [12:39:42] _joe_: hehe :) [12:40:09] <_joe_> which, if I have to include everything, is going to take me about half of Q2 [12:40:44] * YuviPanda wonders if anyone is up for merging patches [12:41:09] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Puppet has 1 failures [12:42:09] RECOVERY - puppet last run on mw1192 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [12:42:32] (03PS2) 10Yuvipanda: dsh: Move dsh related code into a module [puppet] - 10https://gerrit.wikimedia.org/r/162570 [12:43:36] (03CR) 10Filippo Giunchedi: zuul: client to easily query Gearman server (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/162856 (owner: 10Hashar) [12:48:04] (03PS1) 10Yuvipanda: icinga: Remove analytics.cfg according to TEMP: message [puppet] - 10https://gerrit.wikimedia.org/r/162872 [12:49:15] (03PS1) 10Christopher Johnson (WMDE): These changes add the "extension" Sprint. The implementation is actually as a libphutil library. It can be enabled with the setting "load-libraries" in the role manifest. It does not need to be symlinked into the ../phabricator/src/extensions directory. [puppet] - 10https://gerrit.wikimedia.org/r/162873 [12:50:17] (03PS2) 10Christopher Johnson (WMDE): These changes add the "extension" Sprint. The implementation is actually as a libphutil library. It can be enabled with the setting "load-libraries" in the role manifest. It does not need to be symlinked into the ../phabricator/src/extensions directory. [puppet] - 10https://gerrit.wikimedia.org/r/162873 [12:52:34] (03PS3) 10Yuvipanda: dsh: Move dsh related code into a module [puppet] - 10https://gerrit.wikimedia.org/r/162570 [12:58:19] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [13:00:04] K4: Dear anthropoid, the time has come. Please deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140925T1300). [13:07:38] aha, and I think I've found an easier/more elegant solution to making hosts.cfg than writing an openstackmanager API!!!1 [13:08:26] * YuviPanda waves at ottomata [13:08:32] icinga.pp is less than 500 lines now! [13:08:37] I have about 40 patches open tho [13:08:45] think you'll have time to merge any? most are super trivial [13:08:45] (03PS2) 10Manybubbles: Less exciting Elasticsearch configuration [puppet] - 10https://gerrit.wikimedia.org/r/162661 [13:08:55] (03CR) 10Manybubbles: "Yes, its all been applied." [puppet] - 10https://gerrit.wikimedia.org/r/162661 (owner: 10Manybubbles) [13:09:31] YuviPanda: hiya! I will do my best! I took a sick day yesterday so I have a bit of work to catch up on, and today is analytics team quarterly review [13:09:36] ah [13:09:38] alright :) [13:10:41] _joe_: what should go in /etc/puppet/private hiera-wise btw? (trying to get the swift credentials going) [13:12:42] godog: I"m going to merge manybubbles' elasticsearch change [13:12:45] <_joe_> godog: checking [13:12:49] (03PS3) 10Ottomata: Less exciting Elasticsearch configuration [puppet] - 10https://gerrit.wikimedia.org/r/162661 (owner: 10Manybubbles) [13:13:00] just cause i'm poking around in my gerrit queue :) [13:13:07] (03CR) 10Ottomata: [C: 032 V: 032] Less exciting Elasticsearch configuration [puppet] - 10https://gerrit.wikimedia.org/r/162661 (owner: 10Manybubbles) [13:13:14] ottomata: +1 I'm not the only one flushing the gerrit queue in the morning then [13:13:18] :) [13:13:50] <_joe_> godog: within a "hiera" directory, you should have codfw.yaml [13:14:11] (03PS2) 10Ottomata: Escape awk variable in kafkatee output [puppet] - 10https://gerrit.wikimedia.org/r/162282 [13:14:24] (03CR) 10Ottomata: [C: 032 V: 032] Escape awk variable in kafkatee output [puppet] - 10https://gerrit.wikimedia.org/r/162282 (owner: 10Ottomata) [13:14:53] YuviPanda: add me as reviewer on all your changes you want me to look at [13:16:21] ottomata: added you to two (independent patches, unrelated to icinga), the others are a long series of dependent patches for nagios_common [13:16:52] https://gerrit.wikimedia.org/r/#/q/owner:%22Yuvipanda+%253Cyuvipanda%2540gmail.com%253E%22+project:operations/puppet+status:open,n,z [13:16:56] and it spans two pages now :) [13:17:04] hooooboy [13:17:49] ottomata: I've added you to the first four changes [13:17:53] in that series [13:18:17] ok [13:18:18] cool [13:18:25] keepin it tractable :p [13:18:30] ottomata: yeah :) [13:18:49] ottomata: the other two patches are independent. one is removing the vumi code, and other is moving dsh into a module [13:20:47] ja i see the vumi one, looks easy to merge, likely no problems, but i'd want to babysit that one (run puppet manually) to make sure it is all cool [13:20:58] hence the delay until i'm ready to do so :) [13:21:02] ottomata: cool :) [13:27:29] (03PS1) 10Yuvipanda: icinga: Move wikidata monitoring into module [puppet] - 10https://gerrit.wikimedia.org/r/162881 [13:28:48] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: Epic puppet fail [13:33:52] (03PS1) 10Yuvipanda: nagios_common: Move check_paging into module [puppet] - 10https://gerrit.wikimedia.org/r/162882 [13:34:21] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move telnet into module [puppet] - 10https://gerrit.wikimedia.org/r/162244 (owner: 10Yuvipanda) [13:35:25] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move ssh into module [puppet] - 10https://gerrit.wikimedia.org/r/162245 (owner: 10Yuvipanda) [13:36:31] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move snmp into module [puppet] - 10https://gerrit.wikimedia.org/r/162246 (owner: 10Yuvipanda) [13:42:02] INCOMING SPAM [13:42:09] andrewbogott: ^ [13:42:11] (03PS3) 10Yuvipanda: nagios_common: Move tcp_udp into module [puppet] - 10https://gerrit.wikimedia.org/r/162250 [13:42:13] (03PS3) 10Yuvipanda: nagios_common: Move apt into module [puppet] - 10https://gerrit.wikimedia.org/r/162251 [13:42:15] (03PS3) 10Yuvipanda: nagios_common: Move radius into module [puppet] - 10https://gerrit.wikimedia.org/r/162248 [13:42:17] (03PS3) 10Yuvipanda: nagios_common: Move rpc-nfs into module [puppet] - 10https://gerrit.wikimedia.org/r/162249 [13:42:19] (03PS3) 10Yuvipanda: nagios_common: Move disk-smb into module [puppet] - 10https://gerrit.wikimedia.org/r/162254 [13:42:21] (03PS3) 10Yuvipanda: nagios_common: Move disk into module [puppet] - 10https://gerrit.wikimedia.org/r/162255 [13:42:23] (03PS3) 10Yuvipanda: nagios_common: Move breeze into module [puppet] - 10https://gerrit.wikimedia.org/r/162252 [13:42:25] (03PS3) 10Yuvipanda: nagios_common: Move dhcp into module [puppet] - 10https://gerrit.wikimedia.org/r/162253 [13:42:27] (03PS3) 10Yuvipanda: nagios_common: Move snmp into module [puppet] - 10https://gerrit.wikimedia.org/r/162246 [13:42:29] (03PS3) 10Yuvipanda: nagios_common: Move real into module [puppet] - 10https://gerrit.wikimedia.org/r/162247 [13:42:31] (03PS3) 10Yuvipanda: nagios_common: move mysql into module [puppet] - 10https://gerrit.wikimedia.org/r/162267 [13:42:33] (03PS3) 10Yuvipanda: nagios_common: move mrtg into module [puppet] - 10https://gerrit.wikimedia.org/r/162266 [13:42:35] (03PS3) 10Yuvipanda: nagios_common: Move mail into module [puppet] - 10https://gerrit.wikimedia.org/r/162265 [13:42:37] (03PS3) 10Yuvipanda: nagios_common: Move load into module [puppet] - 10https://gerrit.wikimedia.org/r/162264 [13:42:39] (03PS3) 10Yuvipanda: nagios_common: move ntp into module [puppet] - 10https://gerrit.wikimedia.org/r/162271 [13:42:41] (03PS3) 10Yuvipanda: nagios_common: move nt into module [puppet] - 10https://gerrit.wikimedia.org/r/162270 [13:42:43] (03PS3) 10Yuvipanda: nagios_common: move news into module [puppet] - 10https://gerrit.wikimedia.org/r/162269 [13:42:45] (03PS3) 10Yuvipanda: nagios_common: move netware into module [puppet] - 10https://gerrit.wikimedia.org/r/162268 [13:42:47] (03PS2) 10Yuvipanda: icinga: Move wikidata monitoring into module [puppet] - 10https://gerrit.wikimedia.org/r/162881 [13:42:49] (03PS3) 10Yuvipanda: nagios_common: Move ftp into module [puppet] - 10https://gerrit.wikimedia.org/r/162259 [13:42:51] (03PS3) 10Yuvipanda: nagios_common: Move flexlm into module [puppet] - 10https://gerrit.wikimedia.org/r/162258 [13:42:53] (03PS3) 10Yuvipanda: nagios_common: Move dummy into module [puppet] - 10https://gerrit.wikimedia.org/r/162257 [13:42:55] (03PS2) 10Yuvipanda: nagios_common: Move check_paging into module [puppet] - 10https://gerrit.wikimedia.org/r/162882 [13:42:57] (03PS3) 10Yuvipanda: nagios_common: Move dns into module [puppet] - 10https://gerrit.wikimedia.org/r/162256 [13:42:59] (03PS3) 10Yuvipanda: nagios_common: Move ldap into module [puppet] - 10https://gerrit.wikimedia.org/r/162263 [13:43:01] (03PS3) 10Yuvipanda: nagios_common: Move ifstatus into module [puppet] - 10https://gerrit.wikimedia.org/r/162262 [13:43:03] (03PS3) 10Yuvipanda: nagios_common: Move http into module [puppet] - 10https://gerrit.wikimedia.org/r/162261 [13:43:05] (03PS3) 10Yuvipanda: icinga: Remove hppjd check [puppet] - 10https://gerrit.wikimedia.org/r/162260 [13:43:07] (03PS2) 10Yuvipanda: icinga: Remove analytics.cfg according to TEMP: message [puppet] - 10https://gerrit.wikimedia.org/r/162872 [13:43:09] (03PS3) 10Yuvipanda: icinga: Move NSCA code into module [puppet] - 10https://gerrit.wikimedia.org/r/162870 [13:43:11] (03PS3) 10Yuvipanda: nagios_common: move pgsql into module [puppet] - 10https://gerrit.wikimedia.org/r/162272 [13:43:13] (03PS3) 10Yuvipanda: nagios_common: move ping into module [puppet] - 10https://gerrit.wikimedia.org/r/162273 [13:43:15] (03PS3) 10Yuvipanda: nagios_common: move procs into module [puppet] - 10https://gerrit.wikimedia.org/r/162274 [13:43:17] (03PS3) 10Yuvipanda: nagios_common: move vsz into module [puppet] - 10https://gerrit.wikimedia.org/r/162275 [13:43:19] (03PS2) 10Yuvipanda: icinga: Move logrotate into module [puppet] - 10https://gerrit.wikimedia.org/r/162866 [13:43:21] (03PS2) 10Yuvipanda: icinga: Move user / group setup into module [puppet] - 10https://gerrit.wikimedia.org/r/162867 [13:43:23] (03PS3) 10Yuvipanda: icinga: Move icinga web into module [puppet] - 10https://gerrit.wikimedia.org/r/162865 [13:43:25] (03PS2) 10Yuvipanda: nagios_common: Move notification commands into module [puppet] - 10https://gerrit.wikimedia.org/r/162582 [13:43:27] (03PS3) 10Yuvipanda: nagios_common: Move timeperiods definition into module [puppet] - 10https://gerrit.wikimedia.org/r/162583 [13:43:31] eeeee [13:43:40] andrewbogott: that should be it [13:43:50] poor Jenkins [13:43:53] grrrit-wm: didn't get them in any order [13:44:00] so follow the dependency chains again :) [13:44:45] (03PS1) 10Giuseppe Lavagetto: hiera: use structured data in the private repo as well. [puppet] - 10https://gerrit.wikimedia.org/r/162883 [13:45:12] go YuviPanda go [13:45:26] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move snmp into module [puppet] - 10https://gerrit.wikimedia.org/r/162246 (owner: 10Yuvipanda) [13:45:38] <_joe_> YuviPanda: is andrew reviewing those? [13:45:44] _joe_: he is! [13:45:54] <_joe_> andrewbogott: please run puppet on icinga from time to time [13:46:01] <_joe_> to ensure nothing has been broken [13:46:08] <_joe_> godog: ^^ [13:46:09] _joe_: icinga == neon, right? [13:46:14] <_joe_> andrewbogott: yep [13:46:19] <_joe_> sorry [13:46:34] andrewbogott: can you also do find /etc/nagios-plugins and find /usr/lib/nagios/ and paste output on neon? whenever you get on that machine [13:46:44] yep, when puppet finishes [13:47:29] andrewbogott: cool [13:47:42] * YuviPanda hopes no other patches get merged in the meantime, to avoid another rebase chain spam [13:47:51] that, or I should implement some sort of quieting for grrrit-wm [13:48:09] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:48:35] haha, gmail has decided to put jenkins bot in my 'promotions' tab [13:51:09] YuviPanda: as you feared, we're getting a duplicate def in icinga: "Warning: Duplicate definition found for host 'labs-ns1.wikimedia.org' (config file '/etc/icinga/puppet_hosts.cfg', starting on line 4382)" [13:51:29] andrewbogott: that shouldn't be touched by any of my changes [13:51:47] andrewbogott: I was expecting duplicate defs from commands, not from host config [13:51:54] * YuviPanda hasn't touched naggen yet [13:52:21] Also, https://dpaste.de/5oxM and https://dpaste.de/niPG [13:52:45] (03CR) 10Filippo Giunchedi: [C: 031] hiera: use structured data in the private repo as well. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/162883 (owner: 10Giuseppe Lavagetto) [13:54:20] YuviPanda: ok, it may be my fault… I'd like to sort it out before i merge any more though, since right now icinga won't start [13:54:31] andrewbogott: yeah, makes sense. [13:54:49] If I have two different servers with the same role, would that do it? [13:55:20] andrewbogott: depends on how the @monitor_host is setup, possibly [13:55:54] My guess is that since both virt1000 and neptunium are set up for dns in eqiad... [13:55:56] I also have a solution to the dup defs caused by my changes: rm everything in /etc/nagios-plugins/config, except fping.cfg and games.cfg [13:56:04] andrewbogott: what role is this? [13:56:25] Don't know for sure yet, but probably include role::dns::ldap [13:56:49] hm, yep [13:56:55] I can just remove that, actually [13:57:28] andrewbogott: yeah, line 46 of dns.pp [13:57:57] andrewbogott: sets up monitor host with $dns_auth_soa_name, wich I guess is the same for both [14:00:06] andrewbogott: I'm going afk but should be back in 15m [14:00:40] (03PS1) 10Andrew Bogott: Don't run a dns server on neptunium [puppet] - 10https://gerrit.wikimedia.org/r/162885 [14:01:52] (03PS2) 10Andrew Bogott: Don't run a dns server on neptunium [puppet] - 10https://gerrit.wikimedia.org/r/162885 [14:03:11] (03CR) 10Andrew Bogott: [C: 032] Don't run a dns server on neptunium [puppet] - 10https://gerrit.wikimedia.org/r/162885 (owner: 10Andrew Bogott) [14:16:55] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [14:17:55] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:18:08] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move real into module [puppet] - 10https://gerrit.wikimedia.org/r/162247 (owner: 10Yuvipanda) [14:18:16] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move radius into module [puppet] - 10https://gerrit.wikimedia.org/r/162248 (owner: 10Yuvipanda) [14:18:24] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move rpc-nfs into module [puppet] - 10https://gerrit.wikimedia.org/r/162249 (owner: 10Yuvipanda) [14:23:55] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 3 failures [14:24:45] PROBLEM - puppet last run on db1019 is CRITICAL: CRITICAL: Puppet has 1 failures [14:26:00] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [14:27:22] boo, so many races in icinga puppet [14:27:32] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move tcp_udp into module [puppet] - 10https://gerrit.wikimedia.org/r/162250 (owner: 10Yuvipanda) [14:27:41] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move apt into module [puppet] - 10https://gerrit.wikimedia.org/r/162251 (owner: 10Yuvipanda) [14:27:48] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move breeze into module [puppet] - 10https://gerrit.wikimedia.org/r/162252 (owner: 10Yuvipanda) [14:30:20] (03PS2) 10Giuseppe Lavagetto: mediawiki: move memcached servers list to a hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/162622 [14:31:27] andrewbogott: back [14:31:29] andrewbogott: once the 'move into module' things are all merged (last one is vsz, I think), we should rm the old config [14:32:13] YuviPanda: earlier when you said 'remove all files except for…' Can't I just remove all of them and let puppet regenerate the few that we want? [14:32:19] Or are a couple of them not puppetized? [14:32:39] andrewbogott: nope, the others don't seem to be puppetized, and also seem to be for unrelated things (I've no idea what fping is, and games.cfg sounds dubious) [14:32:47] ok [14:33:02] Remind me about that after the Great Merging is done [14:33:16] andrewbogott: yes! [14:34:31] andrewbogott: 24 to go in just the 'config' series, and then there's 9 more after that in the 'modularize icinga' series (which is still ongoing) [14:34:54] Great Merging indeed [14:39:35] RECOVERY - puppet last run on db1019 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [14:39:47] YuviPanda: why the dsh changes? [14:39:49] just curious [14:39:55] are the related to your icinga work? [14:40:00] ottomata: it wasn't in a module, it could be a module, so I made it into one :) [14:41:11] aye, hm, at this point though...if you are going to go that route...hiera? [14:41:17] (03PS4) 10Andrew Bogott: nagios_common: Move dhcp into module [puppet] - 10https://gerrit.wikimedia.org/r/162253 (owner: 10Yuvipanda) [14:41:20] especially for somethign like this, where most of the module is config [14:41:34] (03PS4) 10Andrew Bogott: nagios_common: Move disk-smb into module [puppet] - 10https://gerrit.wikimedia.org/r/162254 (owner: 10Yuvipanda) [14:41:37] ottomata: they're config files, tho? [14:41:45] (03PS4) 10Andrew Bogott: nagios_common: Move disk into module [puppet] - 10https://gerrit.wikimedia.org/r/162255 (owner: 10Yuvipanda) [14:41:49] (03PS4) 10Andrew Bogott: nagios_common: Move dns into module [puppet] - 10https://gerrit.wikimedia.org/r/162256 (owner: 10Yuvipanda) [14:41:55] (03PS4) 10Andrew Bogott: nagios_common: Move dummy into module [puppet] - 10https://gerrit.wikimedia.org/r/162257 (owner: 10Yuvipanda) [14:42:01] (03PS4) 10Andrew Bogott: nagios_common: Move flexlm into module [puppet] - 10https://gerrit.wikimedia.org/r/162258 (owner: 10Yuvipanda) [14:42:07] (03PS4) 10Andrew Bogott: nagios_common: Move ftp into module [puppet] - 10https://gerrit.wikimedia.org/r/162259 (owner: 10Yuvipanda) [14:42:22] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move dhcp into module [puppet] - 10https://gerrit.wikimedia.org/r/162253 (owner: 10Yuvipanda) [14:42:27] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move disk-smb into module [puppet] - 10https://gerrit.wikimedia.org/r/162254 (owner: 10Yuvipanda) [14:42:33] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move disk into module [puppet] - 10https://gerrit.wikimedia.org/r/162255 (owner: 10Yuvipanda) [14:42:38] bblack: what's the range for internal lvs ip in codfw I could use? I'd need one internal address for swift frontend [14:42:51] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move dns into module [puppet] - 10https://gerrit.wikimedia.org/r/162256 (owner: 10Yuvipanda) [14:43:03] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move dummy into module [puppet] - 10https://gerrit.wikimedia.org/r/162257 (owner: 10Yuvipanda) [14:43:06] i suppose, ja, but, if they were yaml....they coudl be used everywhere! [14:43:10] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move flexlm into module [puppet] - 10https://gerrit.wikimedia.org/r/162258 (owner: 10Yuvipanda) [14:43:14] and realized without having to manually edit the module [14:43:15] anyyyyway [14:43:24] also, it looks like the dsh_groups define is not being used anywhere? [14:43:31] check_dsh_groups? [14:43:34] ottomata: indeed, it isn't, which is confusing, but it was there before... [14:43:49] ottomata: check_dsh_groups is a icinga command, and is setup for betalabs, but isn't using this, but betalabs has its own dsh config anyway [14:44:08] YuviPanda: i understand the motivation to move stuff to a module, but I almost think we shouldn't bother doing straight module imports from manifests like this [14:44:18] i think it would be more benificial to refactor when creating a module [14:44:23] rather than just moving everything into a new directory [14:44:25] godog: I don't think anyone assigned one yet, unless it's in RT. The obvious options would be 10.2.5.0/24 (after ulsfo) or 10.2.1.0/24 (steal the unused one from pmtpa). Let me check RT in case mark did assign. [14:44:38] maybe others disagree? not sure. [14:44:42] ottomata: true, but I didn't really find much ways to refactor, since this was already so simple. [14:44:54] ottomata: config files could be in hiera if they're all YAML, but they aren't... [14:45:08] so I could theoretically put them in YAML, and then generate the config from them... [14:45:10] (03PS4) 10Andrew Bogott: icinga: Remove hppjd check [puppet] - 10https://gerrit.wikimedia.org/r/162260 (owner: 10Yuvipanda) [14:45:14] i guess, the manually maintained dsh groups isn't ideal, especially if that informatino could be gleaned from things system::role [14:45:15] or something [14:45:18] (03PS4) 10Andrew Bogott: nagios_common: Move http into module [puppet] - 10https://gerrit.wikimedia.org/r/162261 (owner: 10Yuvipanda) [14:45:25] (03PS4) 10Andrew Bogott: nagios_common: Move ifstatus into module [puppet] - 10https://gerrit.wikimedia.org/r/162262 (owner: 10Yuvipanda) [14:45:25] ottomata: but dsh is fairly central/important, and I don't want to do it in one big patch. [14:45:28] bblack: cool, thanks for checking [14:45:28] there could be some kind of larger metadata that would automatically generate dsh groups [14:45:31] * YuviPanda likes multiple smaller patches for refactoring [14:45:31] (03PS4) 10Andrew Bogott: nagios_common: Move ldap into module [puppet] - 10https://gerrit.wikimedia.org/r/162263 (owner: 10Yuvipanda) [14:45:36] (03PS4) 10Andrew Bogott: nagios_common: Move load into module [puppet] - 10https://gerrit.wikimedia.org/r/162264 (owner: 10Yuvipanda) [14:45:42] (03PS4) 10Andrew Bogott: nagios_common: Move mail into module [puppet] - 10https://gerrit.wikimedia.org/r/162265 (owner: 10Yuvipanda) [14:45:48] (03PS4) 10Andrew Bogott: nagios_common: move mrtg into module [puppet] - 10https://gerrit.wikimedia.org/r/162266 (owner: 10Yuvipanda) [14:45:52] haha, YuviPanda agreed, hence why make this change like this at all? if there is no motivation behind it other than "let's put it in a different directory"? [14:45:57] (03PS2) 10Giuseppe Lavagetto: hiera: use structured data in the private repo as well. [puppet] - 10https://gerrit.wikimedia.org/r/162883 [14:46:07] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hiera: use structured data in the private repo as well. [puppet] - 10https://gerrit.wikimedia.org/r/162883 (owner: 10Giuseppe Lavagetto) [14:46:13] ottomata: because I don't like leaving things in manifests/ or files/ :) [14:46:29] this will also make future refactors easier [14:46:37] godog: I don't see anything in the RT ticket. IMHO let's steal pmtpa range (edit template/10.in-addr.arpa, fix up: [14:46:40] ; pmtpa svc ips [14:46:40] <_joe_> there is a lot of difference [14:46:42] $ORIGIN 1.2.{{ zonename }}. [14:47:00] i dunno, i disagree, if you had a new WIP refactored module, you'd be likely able to have the old one still in place while you use the new one [14:47:26] ottomata: nah, in that case you end up with both. see also: ganglia_new :) [14:47:29] <_joe_> ottomata: because that works so well... [14:47:58] one small, easily verifyable step at a time is the way to go, I think [14:48:03] <_joe_> Yuvi is first of all modularizing a very annoyingly monolithic set of manifests [14:48:14] _joe_: we're talking about dsh, not icinga [14:48:26] <_joe_> YuviPanda: oh [14:48:38] _joe_: which is a smaller set of things, but still spread out across manifests/ files/ [14:49:05] ja, i just don't see the benifit of creating a module just by moving files around, i think the refactor should be taken into account. Yuvi thinks this is a first step in a refactor, but meehhhhh [14:49:07] _joe_: ottomata I'm pretty sure 'big bang refactoring' wouldn't work well for icinga *at all*. [14:49:16] agree. [14:49:19] i like what we are doing with icinga [14:49:21] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 6 failures [14:49:30] but, we are not doing a direct mv file thing for your icinga patches either [14:49:33] which is what this dsh patch is [14:49:35] <_joe_> looks like you managed to break it [14:49:40] <_joe_> icinga I mean [14:49:45] andrewbogott: ^ can you tell me what broke? [14:49:52] <_joe_> I have no time for it now [14:49:56] ottomata: true, but as I said, one step at a time :) I think it marginally improves things [14:50:08] _joe_: yeah, am working with andrewbogott for now. [14:50:15] YuviPanda, _joe_, there's just a race, it always takes two runs to apply YuviPanda's changes [14:50:18] bblack: ack, I'll send a code review [14:50:22] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:50:24] Nothing to worry about [14:50:25] ah lol [14:50:36] <_joe_> andrewbogott: NO [14:50:52] <_joe_> andrewbogott: icinga -v /etc/icinga/icinga.cfg [14:51:29] ? [14:51:46] The errors look like this: Error: /Stage[main]/Icinga::Monitor::Files::Nagios-plugins/File[/etc/nagios-plugins/config/flexlm.cfg]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///files/icinga/plugin-config/flexlm.cfg [14:52:08] So, it tries to read the file from its new location before it is moved… [14:52:14] uh, that's super weird, cine that puppet:/// url shouldn't be referenced anywhere. [14:52:53] it's trying to read them from the old location, I think [14:53:01] <_joe_> andrewbogott: do you ever look at the output from puppet-merge? [14:53:22] <_joe_> maybe some changes were not merged on strontium... [14:53:52] hm, I don't think that's it. [14:53:55] hmm, if they weren't merged, then the second run shouldn't succeed? [14:54:11] also the changes were atomic per-config file, so one change merging and another not shouldn't change much... [14:55:15] So… if you look at the error: The reference is in an icinga config file, not in a puppet manifest. [14:55:31] So, at the end of the first run that icinga conf is updated, after which the file path is correct [14:55:41] no, as to why an icinga conf has a reference to a puppet:/// url... [14:55:59] * YuviPanda is fairly confused atm [14:55:59] (03CR) 10Ottomata: "I don't think it is worth it to just move files directly from manifests/ or files/ into module. This module could have some refactoring t" [puppet] - 10https://gerrit.wikimedia.org/r/162570 (owner: 10Yuvipanda) [14:56:18] hm, me too [14:56:27] (03CR) 10Yuvipanda: ":) I think it's a small incremental improvement, with at worst 0 cos to merging." [puppet] - 10https://gerrit.wikimedia.org/r/162570 (owner: 10Yuvipanda) [14:57:15] ok, what I just said is clearly not right [14:57:35] andrewbogott: did you run icinga -v /etc/icinga/icinga.cfg? I think it is supposed to report any individiual config errors [14:59:25] Anyone have anything for SWAT? [15:00:04] Going, going... [15:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140925T1500). Please do the needful. [15:00:13] I'll go with gone. [15:00:32] https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0September.C2.A025 shows nothing for this slot [15:00:34] (03PS1) 10Filippo Giunchedi: codfw: steal pmtpa svc range [dns] - 10https://gerrit.wikimedia.org/r/162887 [15:00:38] SWAT is declared closed. [15:00:48] YuviPanda: there are some duplicate defs but it's not clear to me that that's related to the puppet failure [15:01:10] andrewbogott: can you pastebin them? [15:01:19] andrewbogott: I am pretty sure those are the rm-worthy things I was talking about [15:01:34] https://dpaste.de/6sZW [15:01:53] (03PS2) 10Filippo Giunchedi: codfw: steal pmtpa svc range [dns] - 10https://gerrit.wikimedia.org/r/162887 [15:01:53] andrewbogott: yeah, that's fixed by rming things. [15:01:58] ok [15:02:12] bblack: ^ [15:02:18] I'm going to merge one more change, and then WAIT a minute and see if that's the only issue with the missing file [15:02:27] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move ftp into module [puppet] - 10https://gerrit.wikimedia.org/r/162259 (owner: 10Yuvipanda) [15:02:30] andrewbogott: ok [15:02:52] because maybe I've just been refreshing puppet before all the masters were caught up [15:05:55] (03CR) 10BBlack: "IMHO, on the wmnet side of things, move it to the bottom of the file (and replace the pmtpa comment/origin there)." [dns] - 10https://gerrit.wikimedia.org/r/162887 (owner: 10Filippo Giunchedi) [15:06:26] heya andrewbogott, do you know anything about silver? [15:06:31] what is it being used for? [15:06:56] http://en.wikipedia.org/wiki/Silver#Applications [15:07:02] haah [15:07:22] wow! [15:07:25] ottomata: I don't know offhand. It looks like there's some mobile stuff running there [15:07:35] that is a very versatile server! [15:07:41] marktraceur, no chance I can get a last minute config change in? [15:07:43] that stuff is about to be removed [15:07:52] aside from that I see it has ldap stuff on it [15:07:57] ldap-admins group is there [15:08:06] It's in the Deployments page, but I can take it out if it's definitely out. [15:08:22] It's just a one-liner in CommonSettings.php to extend an A/B test bucketing period. [15:08:30] superm401: D'you want to do it? [15:08:37] marktraceur, sure, I will. [15:08:38] hah, YuviPanda, _joe_, it looks like the 'race' involved a race between my typing and the strontium merge [15:08:42] I give you my blessing [15:08:46] andrewbogott: haha [15:08:46] Thanks [15:08:52] I'll watch closely-is [15:08:53] h [15:09:34] (03CR) 10Andrew Bogott: [C: 032] icinga: Remove hppjd check [puppet] - 10https://gerrit.wikimedia.org/r/162260 (owner: 10Yuvipanda) [15:09:43] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move http into module [puppet] - 10https://gerrit.wikimedia.org/r/162261 (owner: 10Yuvipanda) [15:09:49] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move ifstatus into module [puppet] - 10https://gerrit.wikimedia.org/r/162262 (owner: 10Yuvipanda) [15:09:59] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move ldap into module [puppet] - 10https://gerrit.wikimedia.org/r/162263 (owner: 10Yuvipanda) [15:10:06] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move load into module [puppet] - 10https://gerrit.wikimedia.org/r/162264 (owner: 10Yuvipanda) [15:10:12] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move mail into module [puppet] - 10https://gerrit.wikimedia.org/r/162265 (owner: 10Yuvipanda) [15:11:22] (03PS3) 10Filippo Giunchedi: codfw: steal pmtpa svc range [dns] - 10https://gerrit.wikimedia.org/r/162887 [15:11:51] bblack: indeed at the bottom is where it belongs [15:11:55] greg-g: I don't see anything about punctuality in the SWAT rules but I feel it might be a nice addition, FYI [15:11:57] (03CR) 10Mattflaschen: [C: 032] "Approved to go today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162806 (owner: 10Mattflaschen) [15:11:59] (/r/nocontext) [15:12:07] Apart from devs being available [15:12:17] :) [15:12:34] godog: Thank you for the leading slash [15:12:50] <_joe_> ori: you're cited at the top of /r/lolphp ! [15:12:52] I hate people who refer to subreddits like "r/iama" [15:13:28] (03Merged) 10jenkins-bot: Extend GettingStarted bucketting period to Sept. 28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162806 (owner: 10Mattflaschen) [15:13:30] marktraceur: you're welcome :) been there, done that [15:13:40] godog: + "27 PTR ...." [15:14:19] <_joe_> in fact "The root of the problem is that HHVM implements Memcached::increment/decrement (and Memcached::incrementByKey / Memcached::decrementByKey) as they are documented, rather than how the PHP5 implementation actually behaves." is quite quotable for that subreddit :) [15:14:36] (03PS5) 10Andrew Bogott: nagios_common: move mrtg into module [puppet] - 10https://gerrit.wikimedia.org/r/162266 (owner: 10Yuvipanda) [15:14:55] heya Reedy, yt? [15:15:05] (03PS4) 10Andrew Bogott: nagios_common: move mysql into module [puppet] - 10https://gerrit.wikimedia.org/r/162267 (owner: 10Yuvipanda) [15:15:07] ottomata: ja [15:15:10] only 10 left (in that series) [15:15:12] (03PS4) 10Andrew Bogott: nagios_common: move netware into module [puppet] - 10https://gerrit.wikimedia.org/r/162268 (owner: 10Yuvipanda) [15:15:16] any clues as to what silver is used for? [15:15:17] (03PS4) 10Andrew Bogott: nagios_common: move news into module [puppet] - 10https://gerrit.wikimedia.org/r/162269 (owner: 10Yuvipanda) [15:15:24] (03PS4) 10Andrew Bogott: nagios_common: move nt into module [puppet] - 10https://gerrit.wikimedia.org/r/162270 (owner: 10Yuvipanda) [15:15:27] i see you logged in there yesterday, that's why i'm asking you [15:15:28] :p [15:15:29] (03PS4) 10Andrew Bogott: nagios_common: move ntp into module [puppet] - 10https://gerrit.wikimedia.org/r/162271 (owner: 10Yuvipanda) [15:15:35] (03PS4) 10Andrew Bogott: nagios_common: move pgsql into module [puppet] - 10https://gerrit.wikimedia.org/r/162272 (owner: 10Yuvipanda) [15:15:37] ottomata: haha [15:15:40] <_joe_> andrewbogott, YuviPanda can't you just rebase them one at a time PLEASE? [15:15:42] ottomata: it's used for ldap stuffs [15:15:55] _joe_: that's what andrewbogott is doing, 5 at a time, instead of me doing them 20 at a time [15:16:03] bblack: thanks for the hand-holding :) are drive-by indentation fixes acceptable too? [15:16:03] <_joe_> ouch [15:16:04] ottomata: or at least, that's why I logged in (to test if I could do ldap stuff), which I can, if I sudo COMMAND [15:16:24] aye ok, hm. [15:16:30] godog: personally, I tend to think it's better to save whitespace fixups for separate commits and just live with ugliness in functional commits [15:16:38] ottomata: site.pp mentions only the vumi stuff, along with ldap [15:16:46] so, i'm about to remove the mobile vumi stuff there form one of Yuvi's patches, but i'm just wondering what else this server is really used for...i guess a place for folks to run ldap commands [15:16:47] yeah [15:16:51] marktraceur: so, given /r/nocontext, help? :) [15:17:00] YuviPanda: your patch doesn't use puppet to remove anything, so i'm going to have to do so manually... [15:17:03] unless we decom this server :) [15:17:05] bblack: yup [15:17:15] YuviPanda: goddamnit [15:17:28] ottomata: yeah, figured this was easier :) I think we can even move the ldap stuff into some other host at some point and decomm this server [15:17:38] ottomata: I can email ops@ if you want [15:18:12] (03CR) 10Andrew Bogott: [C: 032] nagios_common: move mrtg into module [puppet] - 10https://gerrit.wikimedia.org/r/162266 (owner: 10Yuvipanda) [15:18:14] (03PS4) 10Filippo Giunchedi: codfw: steal pmtpa svc range [dns] - 10https://gerrit.wikimedia.org/r/162887 [15:18:21] (03CR) 10Andrew Bogott: [C: 032] nagios_common: move mysql into module [puppet] - 10https://gerrit.wikimedia.org/r/162267 (owner: 10Yuvipanda) [15:18:31] YuviPanda: that would be helpful [15:18:35] (03CR) 10Andrew Bogott: [C: 032] nagios_common: move netware into module [puppet] - 10https://gerrit.wikimedia.org/r/162268 (owner: 10Yuvipanda) [15:18:37] ottomata: cool, doing [15:18:46] ottomata: want to merge and cleanup in the meantime, or should we wait? [15:19:28] !log mattflaschen Synchronized wmf-config/CommonSettings.php: Extend GettingStarted bucketting period end date to Sept. 28 (duration: 00m 07s) [15:19:34] Logged the message, Master [15:20:04] Thanks, marktraceur [15:20:41] YuviPanda: well, if we can decom this server...then I dno't have to clean up :D [15:20:47] so...wait! [15:20:49] ottomata: yeah, makes sense :) [15:20:51] ottomata: emailing now [15:21:30] ottomata: It looks fairly high spec [15:21:33] (03CR) 10BBlack: [C: 031] codfw: steal pmtpa svc range [dns] - 10https://gerrit.wikimedia.org/r/162887 (owner: 10Filippo Giunchedi) [15:21:45] ottomata: done [15:21:52] 16 processors, so I'm guessing it's 8 + HT [15:21:55] (03PS6) 10BBlack: NTP config refactoring + updates [puppet] - 10https://gerrit.wikimedia.org/r/162625 [15:22:04] maybe it was provisioned as a vumi server when their old tampa sever was being threatened by tampa shutdown [15:22:11] that looks like what happened in the RT ticket [15:22:15] (thanks) [15:22:24] YuviPanda: wanna talk nagios_common naming? [15:22:28] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 3 failures [15:22:30] ottomata: hahah :D [15:23:01] (03PS3) 10Glaisher: Add wikidatawiki to wgAppleTouchIcon and add wikidata.png to bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162638 (https://bugzilla.wikimedia.org/70996) [15:23:52] Reedy: or maybe ldap client requires that many cores to run? :) [15:23:59] (03CR) 10Glaisher: "oops" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162638 (https://bugzilla.wikimedia.org/70996) (owner: 10Glaisher) [15:24:01] (03PS7) 10BBlack: NTP config refactoring + updates [puppet] - 10https://gerrit.wikimedia.org/r/162625 [15:24:27] YuviPanda: f*** ldap [15:24:37] (03CR) 10BBlack: [C: 032] Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162748 (owner: 10BBlack) [15:24:38] Reedy: 'tis ok, you can say 'fuck' on the internet :) [15:25:03] we are the engineers that say fsck! [15:25:10] hah [15:25:22] I think you get to say fsck after you've had to actually rescue some valuble data with fsck [15:25:25] hasn't happened to me yet [15:25:27] (03PS1) 10Giuseppe Lavagetto: lvs: add hhvm-api.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/162894 [15:25:30] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:25:54] !seen akosiaris [15:26:06] kart_: He's here now.. ;) [15:26:09] (03CR) 10Andrew Bogott: [C: 032] nagios_common: move news into module [puppet] - 10https://gerrit.wikimedia.org/r/162269 (owner: 10Yuvipanda) [15:26:14] (03CR) 10Andrew Bogott: [C: 032] nagios_common: move nt into module [puppet] - 10https://gerrit.wikimedia.org/r/162270 (owner: 10Yuvipanda) [15:26:55] (03PS4) 10Andrew Bogott: nagios_common: move ping into module [puppet] - 10https://gerrit.wikimedia.org/r/162273 (owner: 10Yuvipanda) [15:27:01] (03PS4) 10Andrew Bogott: nagios_common: move procs into module [puppet] - 10https://gerrit.wikimedia.org/r/162274 (owner: 10Yuvipanda) [15:27:06] (03PS4) 10Andrew Bogott: nagios_common: move vsz into module [puppet] - 10https://gerrit.wikimedia.org/r/162275 (owner: 10Yuvipanda) [15:27:13] (03PS3) 10Andrew Bogott: nagios_common: Move notification commands into module [puppet] - 10https://gerrit.wikimedia.org/r/162582 (owner: 10Yuvipanda) [15:27:49] (03CR) 10Andrew Bogott: [C: 032] nagios_common: move pgsql into module [puppet] - 10https://gerrit.wikimedia.org/r/162272 (owner: 10Yuvipanda) [15:27:51] (03PS2) 10Giuseppe Lavagetto: lvs: add hhvm-api.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/162894 [15:27:56] (03CR) 10Andrew Bogott: [C: 032] nagios_common: move ping into module [puppet] - 10https://gerrit.wikimedia.org/r/162273 (owner: 10Yuvipanda) [15:28:02] (03CR) 10Andrew Bogott: [C: 032] nagios_common: move procs into module [puppet] - 10https://gerrit.wikimedia.org/r/162274 (owner: 10Yuvipanda) [15:28:12] (03CR) 10Andrew Bogott: [C: 032] nagios_common: move vsz into module [puppet] - 10https://gerrit.wikimedia.org/r/162275 (owner: 10Yuvipanda) [15:29:22] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move notification commands into module [puppet] - 10https://gerrit.wikimedia.org/r/162582 (owner: 10Yuvipanda) [15:29:38] (03PS4) 10Andrew Bogott: nagios_common: Move timeperiods definition into module [puppet] - 10https://gerrit.wikimedia.org/r/162583 (owner: 10Yuvipanda) [15:29:42] (03PS1) 10Giuseppe Lavagetto: add hhvm-api.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/162896 [15:29:55] (03PS4) 10Andrew Bogott: icinga: Move icinga web into module [puppet] - 10https://gerrit.wikimedia.org/r/162865 (owner: 10Yuvipanda) [15:30:01] (03PS3) 10Andrew Bogott: icinga: Move logrotate into module [puppet] - 10https://gerrit.wikimedia.org/r/162866 (owner: 10Yuvipanda) [15:30:12] andrewbogott: we should probably stop now, force a run, clear out all warnings before continuing [15:30:25] ok [15:31:56] !log testing ntpd changes on acamar, achernar, chromium, hydrogen, nescio, and baham (puppet-agent disabled) [15:32:00] Logged the message, Master [15:32:12] (03PS8) 10BBlack: NTP config refactoring + updates [puppet] - 10https://gerrit.wikimedia.org/r/162625 [15:32:19] (03CR) 10BBlack: [C: 032 V: 032] NTP config refactoring + updates [puppet] - 10https://gerrit.wikimedia.org/r/162625 (owner: 10BBlack) [15:33:25] (03CR) 10JanZerebecki: [C: 04-1] "The approach is good. See inline comments." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/162873 (owner: 10Christopher Johnson (WMDE)) [15:37:13] (03PS1) 10BBlack: Fix for new ntp.conf template [puppet] - 10https://gerrit.wikimedia.org/r/162898 [15:37:25] (03CR) 10BBlack: [C: 032 V: 032] Fix for new ntp.conf template [puppet] - 10https://gerrit.wikimedia.org/r/162898 (owner: 10BBlack) [15:37:49] YuviPanda: after a puppet run would you expect /etc/nagios-plugins/config to be mostly empty? [15:38:06] andrewbogott: no, since we didn't setup ensure => absent [15:38:19] I mean, after I rm all the files you wanted me to rm [15:38:23] andrewbogott: so 1. we remove the files there (except for fping.cfg and games.cfg), and then after a puppet run, they shouldn't exist [15:38:31] ok, great, that's what i'm seeing [15:38:31] andrewbogott: oh, yeah, should be. they should all be in /etc/icinga/commands [15:38:33] andrewbogott: cool [15:38:52] andrewbogott: icinga -v /etc/icinga/icinga.cfg shows errors? [15:39:51] Yes, many things like Error: Service notification command 'notify-by-sms-gateway' specified for contact 'akosiaris' is not defined anywhere! [15:40:34] andrewbogott: does /etc/icinga/commands/notifycommands.cfg exist? [15:40:50] yes [15:42:07] andrewbogott: is there a line about misccomands.cfg in /etc/icinga/icinga.cfg? [15:42:23] (there shouldn't be) [15:42:31] Also still lots of these: Warning: Duplicate definition found for service 'HTTPS' on host 'cp1008' (config file '/etc/icinga/puppet_services.cfg', starting on line 18827) [15:42:46] andrewbogott: can you pastebin the entire thing? [15:42:59] re: misccommands, I don't see it. [15:43:07] andrewbogott: hmm, ok. [15:43:34] https://dpaste.de/ND4D [15:44:25] hmm [15:44:28] it does find Processing object config file '/etc/icinga/commands/notifycommands.cfg'... [15:45:10] andrewbogott: /etc/icinga/commands/notifycommands.cfg isn't empty, right? [15:45:23] bam [15:45:27] I think it's probably empty [15:45:32] it is empty [15:45:40] andrewbogott: fix coming [15:45:43] ok [15:47:17] (03PS1) 10Yuvipanda: nagios_common: Temp. move notify commands into check_commands [puppet] - 10https://gerrit.wikimedia.org/r/162899 [15:47:18] andrewbogott: ^ [15:47:50] (03PS2) 10Andrew Bogott: nagios_common: Temp. move notify commands into check_commands [puppet] - 10https://gerrit.wikimedia.org/r/162899 (owner: 10Yuvipanda) [15:48:59] PROBLEM - NTP on acamar is CRITICAL: NTP CRITICAL: No response from NTP server [15:49:05] (03PS1) 10BBlack: fix achernar v6 revdns [dns] - 10https://gerrit.wikimedia.org/r/162901 [15:49:42] andrewbogott: I'll refator that into nagios_common::notificationcommands or something soon. [15:49:48] andrewbogott: didn't want to mess with the train [15:50:06] (03PS1) 10Giuseppe Lavagetto: hhvm: serve API as well [puppet] - 10https://gerrit.wikimedia.org/r/162902 [15:50:14] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Temp. move notify commands into check_commands [puppet] - 10https://gerrit.wikimedia.org/r/162899 (owner: 10Yuvipanda) [15:50:50] andrewbogott: I dunno about the dups for cp*** host, seems unrelated [15:51:20] yeah [15:51:25] Weird that it's just for that one host [15:51:37] andrewbogott: aha! it's the one chasemp_ was experimenting with SNI on [15:51:40] so I suspect that's the cause [15:52:36] (03CR) 10BBlack: [C: 032] fix achernar v6 revdns [dns] - 10https://gerrit.wikimedia.org/r/162901 (owner: 10BBlack) [15:52:47] # FIXME: Icinga monitoring with support for SNI [15:52:48] heh [15:53:09] andrewbogott: so pretty sure that's what's happening [15:53:40] ok, so we'll ignore for now [15:54:04] andrewbogott: yeah, I'll poke chasemp_ when he's around [15:54:47] andrewbogott: 8 more? :) [15:54:55] lemme make sure we're stable first [15:55:08] andrewbogott: alright :) [15:55:31] PROBLEM - NTP on chromium is CRITICAL: NTP CRITICAL: No response from NTP server [15:55:51] PROBLEM - NTP on hydrogen is CRITICAL: NTP CRITICAL: No response from NTP server [15:56:00] PROBLEM - NTP on achernar is CRITICAL: NTP CRITICAL: No response from NTP server [15:56:11] PROBLEM - NTP on nescio is CRITICAL: NTP CRITICAL: No response from NTP server [15:59:19] YuviPanda: /etc/icinga/commands/notifycommands.cfg is still empty [15:59:56] andrewbogott: ugh, that's weird. [16:00:03] ^ ignore the NTP errors, just me [16:01:59] (03PS4) 10Glaisher: Add wikidatawiki to wgAppleTouchIcon and add wikidata.png to bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162638 (https://bugzilla.wikimedia.org/70996) [16:04:20] (03CR) 10Glaisher: "Sorry for the mess. Still learning how to use git. Is the image better now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162638 (https://bugzilla.wikimedia.org/70996) (owner: 10Glaisher) [16:09:39] (03PS5) 10Yuvipanda: nagios_common: Move timeperiods definition into module [puppet] - 10https://gerrit.wikimedia.org/r/162583 [16:09:41] (03PS3) 10Yuvipanda: icinga: Remove analytics.cfg according to TEMP: message [puppet] - 10https://gerrit.wikimedia.org/r/162872 [16:09:43] (03PS3) 10Yuvipanda: icinga: Move wikidata monitoring into module [puppet] - 10https://gerrit.wikimedia.org/r/162881 [16:09:43] INCOMING SPAM (not much) [16:09:45] (03PS4) 10Yuvipanda: icinga: Move NSCA code into module [puppet] - 10https://gerrit.wikimedia.org/r/162870 [16:09:47] (03PS3) 10Yuvipanda: nagios_common: Move check_paging into module [puppet] - 10https://gerrit.wikimedia.org/r/162882 [16:09:49] (03PS4) 10Yuvipanda: icinga: Move logrotate into module [puppet] - 10https://gerrit.wikimedia.org/r/162866 [16:09:51] (03PS3) 10Yuvipanda: icinga: Move user / group setup into module [puppet] - 10https://gerrit.wikimedia.org/r/162867 [16:09:53] (03PS5) 10Yuvipanda: icinga: Move icinga web into module [puppet] - 10https://gerrit.wikimedia.org/r/162865 [16:09:55] (03PS1) 10Yuvipanda: nagios_common: Move notification_commands into own class [puppet] - 10https://gerrit.wikimedia.org/r/162905 [16:11:27] YuviPanda: so… can you explain? [16:11:41] (03PS2) 10Yuvipanda: nagios_common: Move notification_commands into own class [puppet] - 10https://gerrit.wikimedia.org/r/162905 [16:11:45] (03PS1) 10Andrew Bogott: Temporarily add the ldap-codfw cert to neptunium. [puppet] - 10https://gerrit.wikimedia.org/r/162906 [16:11:49] andrewbogott: I don't see why puppet doesn't put the file there, but I've refactored it out anyway now [16:12:07] ok -- so what patch do we need to make things start to work again? [16:12:36] andrewbogott: https://gerrit.wikimedia.org/r/#/c/162583/5 and https://gerrit.wikimedia.org/r/#/c/162905/ [16:12:43] (03PS3) 10Alexandros Kosiaris: openldap module [puppet] - 10https://gerrit.wikimedia.org/r/156322 [16:12:44] two because that was the easiest rebase I could do [16:13:39] why does that have a shinken change in it? [16:13:56] oh, because it shares the contact info [16:14:06] andrewbogott: timeperiods are shared, yeah [16:14:07] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move timeperiods definition into module [puppet] - 10https://gerrit.wikimedia.org/r/162583 (owner: 10Yuvipanda) [16:14:14] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move notification_commands into own class [puppet] - 10https://gerrit.wikimedia.org/r/162905 (owner: 10Yuvipanda) [16:14:35] (03PS2) 10Andrew Bogott: Temporarily add the ldap-codfw cert to neptunium. [puppet] - 10https://gerrit.wikimedia.org/r/162906 [16:15:32] (03CR) 10Andrew Bogott: [C: 032] Temporarily add the ldap-codfw cert to neptunium. [puppet] - 10https://gerrit.wikimedia.org/r/162906 (owner: 10Andrew Bogott) [16:15:44] (03PS2) 10Giuseppe Lavagetto: HHVM: update JIT settings [puppet] - 10https://gerrit.wikimedia.org/r/162839 (owner: 10Ori.livneh) [16:16:38] (03PS3) 10Giuseppe Lavagetto: HHVM: update JIT settings [puppet] - 10https://gerrit.wikimedia.org/r/162839 (owner: 10Ori.livneh) [16:18:39] YuviPanda: icinga seems happy again [16:18:51] Weirdly it reports all those duplicate def warnigns, and then at the end says '0 warnings' [16:19:00] andrewbogott: hah :D [16:19:13] ok, so, to finish off this patchset… [16:19:25] I've lost the thread, where to start? [16:19:25] _joe_: morning! awesome stuff, just catching up [16:19:44] (03PS4) 10Yuvipanda: icinga: Remove analytics.cfg according to TEMP: message [puppet] - 10https://gerrit.wikimedia.org/r/162872 [16:19:46] (03PS4) 10Yuvipanda: icinga: Move wikidata monitoring into module [puppet] - 10https://gerrit.wikimedia.org/r/162881 [16:19:46] andrewbogott: let me find. [16:19:48] (03PS5) 10Yuvipanda: icinga: Move NSCA code into module [puppet] - 10https://gerrit.wikimedia.org/r/162870 [16:19:49] andrewbogott: did a rebase [16:19:50] (03PS4) 10Yuvipanda: nagios_common: Move check_paging into module [puppet] - 10https://gerrit.wikimedia.org/r/162882 [16:19:52] (03PS5) 10Yuvipanda: icinga: Move logrotate into module [puppet] - 10https://gerrit.wikimedia.org/r/162866 [16:19:54] (03PS4) 10Yuvipanda: icinga: Move user / group setup into module [puppet] - 10https://gerrit.wikimedia.org/r/162867 [16:19:56] (03PS6) 10Yuvipanda: icinga: Move icinga web into module [puppet] - 10https://gerrit.wikimedia.org/r/162865 [16:20:10] andrewbogott: https://gerrit.wikimedia.org/r/#/c/162865/ [16:20:24] <_joe_> ori: so, you're at the top of my preferred subreddit :P [16:20:38] /r/morbidreality? [16:20:44] /r/phpsucks [16:20:44] <_joe_> /r/lolphp [16:20:53] _joe_: RESOLVED DUPLICATE [16:21:26] hahaha [16:21:32] ! [16:21:45] (03CR) 10Andrew Bogott: [C: 032] icinga: Move icinga web into module [puppet] - 10https://gerrit.wikimedia.org/r/162865 (owner: 10Yuvipanda) [16:21:49] "OP delivers" [16:22:30] <_joe_> ori: and I've just added a small bit to your patch, https://gerrit.wikimedia.org/r/162839, which would select the present version of the package [16:22:39] <_joe_> but I gotta get off for some time now :) [16:22:44] and this is why we should stop naming classes after their protocol without namespacing [16:22:46] <_joe_> I'll be back later though [16:22:49] _joe_: can i merge it? [16:22:54] prod is broken without it [16:22:55] <_joe_> ori: feel free [16:22:55] forever and ever, the Memcached class name is wrong and wont be fixed :P [16:23:00] _joe_: awesome, thanks [16:23:02] (03CR) 10Dzahn: "i understand you don't wanna introduce new changes while moving it but fwiw: including the old webserver:: class should not be needed anym" [puppet] - 10https://gerrit.wikimedia.org/r/162865 (owner: 10Yuvipanda) [16:23:03] <_joe_> ori: prod is not right now [16:23:12] <_joe_> because we're on the old package [16:23:32] (03CR) 10Yuvipanda: "Ah, cool. Yeah, I can investigate and submit a follow up patch. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/162865 (owner: 10Yuvipanda) [16:23:33] _joe_: nah i mean prod is broken because of the bugs [16:23:52] <_joe_> ori: also, IDK about the change to the extensions path [16:23:57] <_joe_> we need that here as well [16:24:06] <_joe_> it's on beta for sure [16:24:08] _joe_: that's done already [16:24:15] <_joe_> in prod as well? [16:24:21] yeah, i did that [16:24:23] <_joe_> so maybe prod is broken? [16:24:33] <_joe_> given we still have the old package there, right? [16:24:35] no, i copied over the old ext dir there [16:24:39] to the new ext folder [16:24:48] i swear we agreed to do this :P [16:24:49] <_joe_> right now I remember [16:24:50] <_joe_> sorry [16:24:52] <_joe_> yes [16:24:58] <_joe_> I told you [16:25:01] <_joe_> I need a break [16:25:07] go! :P [16:26:04] (03CR) 10Andrew Bogott: [C: 032] icinga: Move logrotate into module [puppet] - 10https://gerrit.wikimedia.org/r/162866 (owner: 10Yuvipanda) [16:26:15] (03CR) 10Andrew Bogott: [C: 032] icinga: Move user / group setup into module [puppet] - 10https://gerrit.wikimedia.org/r/162867 (owner: 10Yuvipanda) [16:27:29] (03PS4) 10Ori.livneh: HHVM: update JIT settings [puppet] - 10https://gerrit.wikimedia.org/r/162839 [16:27:30] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Epic puppet fail [16:27:36] (03CR) 10Ori.livneh: [C: 032 V: 032] HHVM: update JIT settings [puppet] - 10https://gerrit.wikimedia.org/r/162839 (owner: 10Ori.livneh) [16:27:57] mutante: hmm, I checked out apache/init.pp, and don't see php5 included by default... [16:28:12] otoh, I don't even know if icinga needs php5 [16:28:42] mutante: indeed, it the webserver role includes php5 and ssl [16:31:31] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Epic puppet fail [16:33:30] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:36:16] (03PS1) 10Reedy: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162911 [16:36:18] (03PS1) 10Reedy: testwiki to 1.25wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162912 [16:36:20] (03PS1) 10Reedy: Wikipedias to 1.24wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162913 [16:36:22] (03PS1) 10Reedy: group0 to 1.25wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162914 [16:36:27] TIL my irc client turns "/r/" into a link to reddit [16:36:32] (03CR) 10Reedy: [C: 032] Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162911 (owner: 10Reedy) [16:36:36] (03CR) 10Andrew Bogott: [C: 032] icinga: Move NSCA code into module [puppet] - 10https://gerrit.wikimedia.org/r/162870 (owner: 10Yuvipanda) [16:36:38] (03Merged) 10jenkins-bot: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162911 (owner: 10Reedy) [16:36:41] (03CR) 10Reedy: [C: 032] testwiki to 1.25wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162912 (owner: 10Reedy) [16:37:09] * godog CTCP VERSION bd808 [16:37:21] Textual [16:37:22] (03Merged) 10jenkins-bot: testwiki to 1.25wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162912 (owner: 10Reedy) [16:37:23] RedditIRC [16:37:48] Reedy: you're adding symlinks and making ori sad :) [16:38:17] bd808: oh ok, also image inlining, that can go wrong in so many ways in some channels [16:38:34] * YuviPanda also likes image inlining, and yeah, can go wrong.... [16:38:42] !log reedy Purged l10n cache for 1.24wmf20 [16:38:47] godog: I'm running an 8 month old fork of https://github.com/Codeux/Textual that I can't make public because I mixed GPLv2 code into their BSD codebase :( [16:39:08] Also I have image inlining disabled because yuck [16:40:24] (03PS1) 10Reedy: Why is 1.24wmf9 still around? [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162916 [16:40:26] I thought we fixed that script to make the symlinks relative. Apparently not. [16:40:38] (03CR) 10Reedy: [C: 032] Why is 1.24wmf9 still around? [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162916 (owner: 10Reedy) [16:40:43] (03Merged) 10jenkins-bot: Why is 1.24wmf9 still around? [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162916 (owner: 10Reedy) [16:41:51] !log Purged php-1.24wmf9 [16:42:32] So php-1.24wmf15 is the oldest we have.. [16:42:36] * Reedy wonders what can be deleted [16:43:20] 15-17 can go? [16:43:31] "Any branch checkout on the deployment server that has not been used for more than 5 weeks can be safely removed to reduce disk usage across the cluster." -- haven't done the math [16:43:51] 21 is going dead today [16:43:54] we should automate this [16:43:59] (to state the obvious) [16:44:02] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: No response from NTP server [16:44:05] so 5 weeks ago 16 went dead [16:44:17] * Reedy deletes 15 and 16 [16:44:24] What rolled off on 2014-08-21? That's 5 weeks ago today [16:44:27] bd808: ouch (re: bsd + gpl) [16:45:15] on 21st wmf18 was new [16:45:43] godog: Yeah. I tired to get a waiver from the author of the gpl code and found out he hated the Textual guys for branching at the commit before his license change. It was quite an email drama. [16:46:03] (03CR) 10Andrew Bogott: [C: 032] icinga: Remove analytics.cfg according to TEMP: message [puppet] - 10https://gerrit.wikimedia.org/r/162872 (owner: 10Yuvipanda) [16:46:08] *tried [16:46:13] (03CR) 10Andrew Bogott: [C: 032] icinga: Move wikidata monitoring into module [puppet] - 10https://gerrit.wikimedia.org/r/162881 (owner: 10Yuvipanda) [16:46:22] So yeah, 15 and 16 [16:46:33] Reedy: wmf17 would be on the cusp, yeah [16:46:45] I would personally drop it on tuesday [16:46:47] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:47:10] and +1 for ori's suggestion of figuring out how to automate this [16:47:26] (03PS1) 10Reedy: Remove 1.25wmf15 and 1.24wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162917 [16:47:32] find . -mtime +35 -exec rm {} \; ? [16:47:36] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: No response from NTP server [16:48:01] and a case of beer to anyone who figures out how to separate assets from the rest of the code so we only have 2 php branches and then the old assets until varnish cache expires [16:48:17] We don't always do a new version every week either (rare, but whatever) [16:48:26] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: No response from NTP server [16:48:38] https://gerrit.wikimedia.org/r/#/c/118337/ [16:48:41] and a keg of beer to anyone who figures out how to keep wmfX links out of varnish in the first place. [16:48:56] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: No response from NTP server [16:49:16] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: No response from NTP server [16:49:16] (03CR) 10Reedy: [C: 032] Remove 1.25wmf15 and 1.24wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162917 (owner: 10Reedy) [16:49:21] (03Merged) 10jenkins-bot: Remove 1.25wmf15 and 1.24wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162917 (owner: 10Reedy) [16:49:26] greg-g: Yeh. necessary but not sufficient [16:49:32] * greg-g nods [16:49:43] I was looking for a bug to track it, and found that instead [16:49:47] It can clean something up but you have to tell it what. And there is some manual crap after [16:50:17] https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Remove_clones_of_expired_branches [16:50:32] I guess we could script it a little more, possibly even making it more interactive [16:50:42] Do you want to delete php-1.XXwmfYY? (Y/N) [16:51:12] YuviPanda: I'm going to merge one more patch, then I need to go for a bit. There's a whole other patchset after this one, right? [16:51:37] Reedy: Do you want to delete php-1.XXwmfYY? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry): [16:51:42] I thought about adding a data file (json or something) on tin that tracked when each branch was added/removed from wikiversions.json [16:51:43] andrewbogott: yeah, but they haven't been written yet. If you merge this set, that's it for now :) [16:51:55] ori: trebuchet much? [16:52:11] YuviPanda: oh, great! I thought I would never catch up [16:52:16] (03PS1) 10Yuvipanda: nagios_common: Move contacts managing into module [puppet] - 10https://gerrit.wikimedia.org/r/162918 [16:52:20] !log reedy Started scap: testwiki to 1.25wmf1 and build l10n cache [16:52:20] I guess git bisect could actually be used for that somehow? [16:52:22] andrewbogott: haha :D I just added one more, and I swear that's it for this time :) [16:52:40] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move check_paging into module [puppet] - 10https://gerrit.wikimedia.org/r/162882 (owner: 10Yuvipanda) [16:52:46] 1.25! Now I need to bump the @since tags again in my logging patches :( [16:53:20] (03PS2) 10Andrew Bogott: nagios_common: Move contacts managing into module [puppet] - 10https://gerrit.wikimedia.org/r/162918 (owner: 10Yuvipanda) [16:53:35] ori: now you're just trolling (more than normal) :) [16:54:20] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move contacts managing into module [puppet] - 10https://gerrit.wikimedia.org/r/162918 (owner: 10Yuvipanda) [16:54:29] (03CR) 10Dzahn: "i wouldn't even move this. instead delete it." [puppet] - 10https://gerrit.wikimedia.org/r/162882 (owner: 10Yuvipanda) [16:54:54] (03CR) 10Yuvipanda: "Ah, does nobody use this?" [puppet] - 10https://gerrit.wikimedia.org/r/162882 (owner: 10Yuvipanda) [16:55:10] andrewbogott: do run icinga -v before you go? [16:55:17] yep [16:55:28] I've been after every patch or two -- everything's clean so far [16:56:04] * greg-g plagarized bd808 without attribution on https://bugzilla.wikimedia.org/show_bug.cgi?id=71313 [16:56:44] (03PS1) 10Yuvipanda: shinken: Specify config_dir for contacts [puppet] - 10https://gerrit.wikimedia.org/r/162919 [16:56:46] andrewbogott: ah cool [16:56:48] (03PS1) 10Ori.livneh: Graphite: set Access-Control-Allow-Credentials for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/162920 [16:56:50] (03CR) 10jenkins-bot: [V: 04-1] shinken: Specify config_dir for contacts [puppet] - 10https://gerrit.wikimedia.org/r/162919 (owner: 10Yuvipanda) [16:57:14] (03PS2) 10Yuvipanda: shinken: Specify config_dir for contacts [puppet] - 10https://gerrit.wikimedia.org/r/162919 [16:57:14] The l10n caches are the biggest disk hog. They can be dropped safely on each Tuesday for any inactive branches. [16:57:20] andrewbogott: ^ as well? only affects shinken [16:57:36] That should really be automated [16:57:51] you could iterate through wikiversions.cdb to figure out which branches are unusued [16:58:01] that way if we're off the regular schedule for some reason the scripts don't break [16:58:09] yeah. It would be pretty simple I think. [16:58:10] (03CR) 10Dzahn: "i thought it was the check that checked the old USB device that we once used for paging.. but apparently it's not, and just a dummy check " [puppet] - 10https://gerrit.wikimedia.org/r/162882 (owner: 10Yuvipanda) [16:58:38] (03CR) 10Yuvipanda: ":D cool!" [puppet] - 10https://gerrit.wikimedia.org/r/162882 (owner: 10Yuvipanda) [16:59:03] bd808: you could even decide to retain $OLDEST_DEPLOYED_BRANCH-1 for reverts [16:59:39] yes, but by Tuesday the chance of reverting the 'pedias is pretty low [16:59:48] nod [17:00:33] I wouldn't want to drop the N-2 cache on Thursday though [17:01:01] andrewbogott: HUGE THANKS! \o/ I shall buy you beverae of choice when we meet [17:01:10] YuviPanda: nice work! [17:01:45] ori: I think that's 50 patches merged in 2 days [17:04:26] (03PS1) 10Ottomata: Add cdh::hadoop::mount class to mount HDFS via fuse [puppet/cdh] - 10https://gerrit.wikimedia.org/r/162921 [17:06:23] (03CR) 10Filippo Giunchedi: [C: 031] Graphite: set Access-Control-Allow-Credentials for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/162920 (owner: 10Ori.livneh) [17:08:07] godog: alright if i merge? i can babysit [17:09:14] ori: yes please! I'm fairly spent [17:09:35] thanks! [17:10:54] (03PS2) 10Ori.livneh: Graphite: set Access-Control-Allow-Credentials for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/162920 [17:11:38] (03CR) 10Ori.livneh: [C: 032] Graphite: set Access-Control-Allow-Credentials for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/162920 (owner: 10Ori.livneh) [17:11:53] mw deployments noob, when should I be expecting 1.25wmf1 to hit production? (asking because of https://gerrit.wikimedia.org/r/#/c/157157/) [17:17:38] (03PS1) 10BBlack: ntp: use explicit S2 upstreams [puppet] - 10https://gerrit.wikimedia.org/r/162922 [17:18:30] (03PS4) 10Filippo Giunchedi: swift: refactor into module, add codfw [puppet] - 10https://gerrit.wikimedia.org/r/162291 [17:19:22] (03CR) 10BBlack: [C: 032] ntp: use explicit S2 upstreams [puppet] - 10https://gerrit.wikimedia.org/r/162922 (owner: 10BBlack) [17:20:04] godog: why not switch to an existing module? [17:20:28] there was a good one around [17:20:41] needed some changes for us, but was overall in a better state than ours iirc [17:20:49] swift puppet module that is [17:20:56] !log reedy Finished scap: testwiki to 1.25wmf1 and build l10n cache (duration: 28m 36s) [17:21:07] +1 for reusing modules [17:22:35] That was "quick" [17:22:46] godog: It's on testwiki now [17:22:57] paravoid: the one from stackforge right? I remember looking at it but decided it was trying too much stuff at once [17:22:59] godog: It'll be on all wikipedias next thursday [17:23:11] godog: commons on tuesday [17:23:15] paravoid: e.g. https://github.com/stackforge/puppet-swift/blob/master/manifests/init.pp#L33 [17:23:40] (03PS2) 10Ottomata: Add cdh::hadoop::mount class to mount HDFS via fuse [puppet/cdh] - 10https://gerrit.wikimedia.org/r/162921 [17:23:54] argh [17:24:00] that wasn't there last time I was looking at it [17:24:08] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset -0.003914 secs [17:24:24] or was it a different module [17:24:28] I forget... [17:24:37] * YuviPanda should replace our git module with something sane sometime [17:24:41] Reedy: cool, thanks! I found the deployment calendar meanwhile :) does the deployments bot support querying for a specific version? [17:25:19] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset 0.002008 secs [17:25:20] paravoid: heh eventually I resorted to mending ours and try out hiera [17:25:29] RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset 0.005284 secs [17:25:35] godog: what do you mean? [17:25:39] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset -0.001046 secs [17:25:59] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset -0.001895 secs [17:26:09] Reedy: the bot will tell the channel it is time to deploy, can I ask it the other way around? "tell me when this version is going to be deployed" [17:26:12] (03PS1) 10Yuvipanda: icinga: Remove resource.cfg from refresh list [puppet] - 10https://gerrit.wikimedia.org/r/162924 [17:26:14] (03PS1) 10Yuvipanda: icinga: Move packages into module [puppet] - 10https://gerrit.wikimedia.org/r/162925 [17:26:16] (03CR) 10jenkins-bot: [V: 04-1] icinga: Remove resource.cfg from refresh list [puppet] - 10https://gerrit.wikimedia.org/r/162924 (owner: 10Yuvipanda) [17:26:18] godog: ah, I don't think so [17:26:19] godog: I think it was this one [17:26:20] (03CR) 10jenkins-bot: [V: 04-1] icinga: Move packages into module [puppet] - 10https://gerrit.wikimedia.org/r/162925 (owner: 10Yuvipanda) [17:26:26] YuviPanda: denied [17:26:27] YuviPanda: denied [17:26:32] it has some interesting parts, maybe frankenstein it [17:26:45] (03PS2) 10Yuvipanda: icinga: Remove resource.cfg from refresh list [puppet] - 10https://gerrit.wikimedia.org/r/162924 [17:26:50] (03PS2) 10Yuvipanda: icinga: Move packages into module [puppet] - 10https://gerrit.wikimedia.org/r/162925 [17:26:56] Reedy: heh [17:27:29] honestly I'd prefer moving into it and fork it/submit a bunch of patches for our use cases, but ymmv [17:28:19] paravoid: yeah that's true some bits are interesting, what I wanted to do was get codfw off the ground asap tbh but yes I agree that's a better way to go [17:29:35] so re: codfw are you going with a separate cluster after all? [17:29:47] I saw the mail but I didn't have the time to meaningfully contribute to that discussion :( [17:31:30] paravoid: yeah, even though it'd be nice to try out multiple regions it might introduce more problems than gains [17:31:48] mostly based on the pragmatic choice that mw can't run active/active anyway [17:32:34] (03PS1) 10BBlack: fix one of the EU ntp s2 [puppet] - 10https://gerrit.wikimedia.org/r/162926 [17:32:46] (03CR) 10BBlack: [C: 032 V: 032] fix one of the EU ntp s2 [puppet] - 10https://gerrit.wikimedia.org/r/162926 (owner: 10BBlack) [17:34:23] (03CR) 10Gage: [C: 031] Add cdh::hadoop::mount class to mount HDFS via fuse [puppet/cdh] - 10https://gerrit.wikimedia.org/r/162921 (owner: 10Ottomata) [17:35:59] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [17:37:05] (03CR) 10Ottomata: [C: 032 V: 032] Add cdh::hadoop::mount class to mount HDFS via fuse [puppet/cdh] - 10https://gerrit.wikimedia.org/r/162921 (owner: 10Ottomata) [17:37:59] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset -0.010019 secs [17:38:56] (03PS1) 10BBlack: ntp: another eu s2 list improvement [puppet] - 10https://gerrit.wikimedia.org/r/162927 [17:39:07] (03CR) 10BBlack: [C: 032 V: 032] ntp: another eu s2 list improvement [puppet] - 10https://gerrit.wikimedia.org/r/162927 (owner: 10BBlack) [17:39:23] mhmm [17:39:27] we've a custom init script for icinga [17:39:30] dunno if that is still required [17:40:17] (03PS1) 10RobH: setting mgmt ip addresses for codfw mw servers [dns] - 10https://gerrit.wikimedia.org/r/162928 [17:40:31] (03CR) 10jenkins-bot: [V: 04-1] setting mgmt ip addresses for codfw mw servers [dns] - 10https://gerrit.wikimedia.org/r/162928 (owner: 10RobH) [17:40:51] bah, i left off the trailing . on every singe one... [17:41:40] (03PS2) 10RobH: setting mgmt ip addresses for codfw mw servers [dns] - 10https://gerrit.wikimedia.org/r/162928 [17:42:51] (03CR) 10RobH: [C: 032] setting mgmt ip addresses for codfw mw servers [dns] - 10https://gerrit.wikimedia.org/r/162928 (owner: 10RobH) [17:43:09] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [17:44:09] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset 0.003148 secs [17:45:13] (03CR) 10Dzahn: These changes add the "extension" Sprint. The implementation is actually as a libphutil library. It can be enabled with the setting "load- (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/162873 (owner: 10Christopher Johnson (WMDE)) [17:45:58] (03CR) 10Dzahn: These changes add the "extension" Sprint. The implementation is actually as a libphutil library. It can be enabled with the setting "load- (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/162873 (owner: 10Christopher Johnson (WMDE)) [17:47:17] (03CR) 10Dzahn: [C: 031] phabricator - enable HSTS with max-age 7 days [puppet] - 10https://gerrit.wikimedia.org/r/162805 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [17:48:45] noooo, another 134 scap targets!? [17:48:54] :) [17:49:24] (03CR) 10Dzahn: "for appservers, wouldn't you also have too look into hhvm logfile format being the same and consume that (at some point)" [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb) [17:49:58] (03PS1) 10Ottomata: Mount HDFS at /mnt/hdfs read only on role::analytics::clients (stat1002 and analytics1027) [puppet] - 10https://gerrit.wikimedia.org/r/162930 [17:50:28] hmm, there's a bunch of code in icinga.pp that I don't know is still needed, and I can't check because I don't have neon access :( [17:50:56] (03CR) 10Dzahn: "last time i tried this people told me "but we're not going to use dsh anyways, replace it all with salt"....i think i even abandoned somet" [puppet] - 10https://gerrit.wikimedia.org/r/162570 (owner: 10Yuvipanda) [17:51:30] (03CR) 10Yuvipanda: "heh, we still use dsh for a lot of things (Scap, for one), so I don't think it is going away anytime soon..." [puppet] - 10https://gerrit.wikimedia.org/r/162570 (owner: 10Yuvipanda) [17:52:01] (03CR) 10Dzahn: "the "it's not worth it" argument.. i dunno.. in the moment you say that somebody already did it.. kind of" [puppet] - 10https://gerrit.wikimedia.org/r/162570 (owner: 10Yuvipanda) [17:52:21] mutante: want to merge ^? :) [17:52:44] (03CR) 10Dzahn: "lol @ dsh not going away soon.. everytime we touch it somebody says it will go soon :)" [puppet] - 10https://gerrit.wikimedia.org/r/162570 (owner: 10Yuvipanda) [17:53:20] Do deployment.wikimedia.beta.wmflabs.org route through polonium ? [17:54:20] err [17:54:24] (03CR) 10Gage: [C: 031] Mount HDFS at /mnt/hdfs read only on role::analytics::clients (stat1002 and analytics1027) [puppet] - 10https://gerrit.wikimedia.org/r/162930 (owner: 10Ottomata) [17:54:56] tonythomas: no [17:55:04] tonythomas: wmflabs is kept entirely separate from production [17:55:14] (03CR) 10Yuvipanda: "ah, optimists. They're always wrong until they aren't." [puppet] - 10https://gerrit.wikimedia.org/r/162570 (owner: 10Yuvipanda) [17:55:22] YuviPanda: ok. so it route through labs on mx configuration ? [17:55:28] tonythomas: I think so [17:55:39] * YuviPanda is unsure about how mail works, in general, so is the worst person to ask [17:56:14] (03CR) 10Dzahn: "Carolynne from Zero said she thinks it can go but would like Dan Foy to confirm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162505 (owner: 10Dzahn) [17:56:16] Jeff_Green: Looks like we have a separate mx for labs. That simplifies a lot of things [17:56:40] YuviPanda: ya. We saw realm switches a lot. So thought where we would find all those configs. [17:56:42] orly [17:56:49] tonythomas: ah [17:57:06] I've no idea what 'separate mx' means anyway, and Jeff_Green probably knows much better than me :) [17:57:35] you would think, but I don't know the labs mail setup [17:57:52] I don't think labs has a specific mail setup, and I've always been surprised when outgoing mail from labs just works [17:57:53] tonythomas: send me a message from your labs instance and I'll look at how it was routed [17:58:06] i thought we routed it through the normal mx's [17:58:11] Jeff_Green: yeah ! in a min [17:58:24] (03PS1) 10BBlack: Switch all clients to new NTP servers [puppet] - 10https://gerrit.wikimedia.org/r/162931 [17:58:26] (03PS1) 10BBlack: switch debian installer to new NTP servers [puppet] - 10https://gerrit.wikimedia.org/r/162932 [17:58:39] heh [17:58:42] thats a lot of files [17:58:49] (03Restored) 10Dzahn: move dsh to module [puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [17:59:12] (03CR) 10Dzahn: "also see matanya on - upload date 2013 btw" [puppet] - 10https://gerrit.wikimedia.org/r/162570 (owner: 10Yuvipanda) [17:59:15] bblack: so we're ditching the ntp.eqiad format? [17:59:41] Jeff_Green: https://dpaste.de/obwv#L20 [17:59:46] (03CR) 10Dzahn: "matanya, also see https://gerrit.wikimedia.org/r/#/c/162570/" [puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [17:59:48] looks like it got into polonium somewhere [17:59:49] robh: I don't see much value in it, personally. It was only for the installer, and it's a CNAME to a single host, which still has to be updated when the NTP config in puppet is updated [18:00:04] the way I see it, this keeps it all in puppet instead of having to fix it in two places [18:00:04] Reedy, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140925T1800). Please do the needful. [18:00:16] yea but the cname is the only change in event of outage of the server [18:00:26] mutante: cool! it needs rebase tho [18:00:26] (for installs) [18:00:34] tonythomas: indeed it does, it appears to have gone straight from deployment-mediawiki02.eqiad.wmflabs to polonium [18:00:40] yea [18:00:42] bblack: robh https://gerrit.wikimedia.org/r/#/c/162496/ [18:00:44] its indeed only installs, so yea [18:00:59] robh: notably the time cname for esams doesn't even exist and nobody's complained. who knows how long... [18:01:00] Jeff_Green: interesting again [18:01:08] lemme try something here... [18:01:15] okey [18:01:48] rcpt to: jgreen@deployment-mediawiki02.eqiad.wmflabs [18:01:49] 550 Relay not permitted [18:01:58] and no return path by that hostname anyway [18:02:08] if you guys prefer, I can leave it in DNS (but I'll probably still move them to .wm.o since they're not in 10/8) [18:02:24] so we're where I thought, you need an in-labs mail bouncer host to test with [18:02:52] (03PS3) 10BBlack: Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162749 [18:02:59] (03CR) 10BBlack: [C: 032 V: 032] Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162749 (owner: 10BBlack) [18:03:23] Jeff_Green: and with the host name deployment-mediawiki02.eqiad.wmflabs ? [18:03:36] no [18:04:19] (03CR) 10Dzahn: "bblack: should this be abandoned too ?" [puppet] - 10https://gerrit.wikimedia.org/r/162175 (owner: 10Dzahn) [18:04:33] (03CR) 10Dzahn: "this as well? abandon?" [dns] - 10https://gerrit.wikimedia.org/r/162496 (owner: 10Dzahn) [18:04:51] Jeff_Green: then how can we expect the bounce to come back all the way till our deployment-mediawiki02 ? [18:04:52] I'm not super familiar with the deployment project, but I think the easiest thing to do would be to hijack outbound mail deployment-mediawiki02.eqiad.wmflabs and route it to a test host where you can mess with the response, as we did in earlier testing [18:04:57] you can't [18:05:01] mutante: yeah probably. I saw all of those, but I decided to Be Bold and just go redo everything differently :) [18:05:23] (03Abandoned) 10Dzahn: NTP client config - use rubidium/eeden as servers [puppet] - 10https://gerrit.wikimedia.org/r/162175 (owner: 10Dzahn) [18:05:28] tonythomas: afaik you can not route mail in to labs from the outside world, it's that simple [18:05:32] (03Abandoned) 10Dzahn: NTP service aliases, switch eqiad, add esams [dns] - 10https://gerrit.wikimedia.org/r/162496 (owner: 10Dzahn) [18:06:02] there's a related one from alex in there somewhere too [18:06:19] Jeff_Green: ok. so we should have the bete use our test mail server right ? insted of polonium ? [18:06:29] (03PS2) 10Reedy: Wikipedias to 1.24wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162913 [18:06:34] (03CR) 10Reedy: [C: 032] Wikipedias to 1.24wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162913 (owner: 10Reedy) [18:06:39] (03Merged) 10jenkins-bot: Wikipedias to 1.24wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162913 (owner: 10Reedy) [18:07:14] (03Abandoned) 10BBlack: Setup EQIAD NTP servers [puppet] - 10https://gerrit.wikimedia.org/r/161984 (owner: 10Alexandros Kosiaris) [18:07:25] tonythomas: i think that's simplest yes [18:08:09] Jeff_Green: looks like more exim changes. Let me setup the instance. [18:08:17] (03PS3) 10Yuvipanda: icinga: Move packages into module [puppet] - 10https://gerrit.wikimedia.org/r/162925 [18:08:19] (03PS3) 10Yuvipanda: icinga: Remove resource.cfg from refresh list [puppet] - 10https://gerrit.wikimedia.org/r/162924 [18:08:20] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.24wmf22 [18:08:21] (03PS3) 10Yuvipanda: shinken: Specify config_dir for contacts [puppet] - 10https://gerrit.wikimedia.org/r/162919 [18:08:23] (03PS1) 10Yuvipanda: icinga: Move naggen into module [puppet] - 10https://gerrit.wikimedia.org/r/162936 [18:08:33] or do we use our existing verpverpverp ( our current mx that routes mediawiki-verp instance ) [18:08:42] (03CR) 10BBlack: [C: 032] Switch all clients to new NTP servers [puppet] - 10https://gerrit.wikimedia.org/r/162931 (owner: 10BBlack) [18:09:56] tonythomas: another possibility would be to use an iptables rule to redirect outbound traffic to destination-port 25 to some labs host [18:10:06] * YuviPanda is fairly proud of the verpverpverp hostname [18:10:14] (03CR) 10Dzahn: "no ops reviews - will abandon" [puppet] - 10https://gerrit.wikimedia.org/r/153986 (owner: 10Dzahn) [18:10:16] (03PS3) 10BBlack: Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162750 [18:10:50] (03CR) 10BBlack: [C: 032] Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162750 (owner: 10BBlack) [18:11:09] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Epic puppet fail [18:11:28] PROBLEM - puppet last run on rdb1001 is CRITICAL: CRITICAL: Epic puppet fail [18:11:28] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Epic puppet fail [18:11:29] PROBLEM - puppet last run on db72 is CRITICAL: CRITICAL: Epic puppet fail [18:11:51] heh [18:11:54] I hope that's not me! [18:12:25] Jeff_Green: some labs host ? we will need to have the mx config running on that one right ? [18:12:29] YuviPanda: ha :) true [18:12:33] we really should remove 'EPIC' from that notice [18:12:37] bugs me every time [18:12:40] haha [18:12:57] its all like "oh , you have a single syntax error in an erb template for some inconsequential thing...MUST BE EPIC" [18:13:30] tonythomas: yeah, you'd use the mta that host to mess with responses [18:13:32] heh it was me, but, apparently the puppet updates are not transactional [18:13:48] so a few clients hit the old manifests but the newly-missing template [18:13:52] :p [18:14:13] they'll fix themselves next run [18:14:28] RECOVERY - puppet last run on rdb1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:14:41] Jeff_Green: ok. now trying to spawn a fresh mx [18:14:42] (^ I did that one to verify) [18:15:08] (03PS1) 10Ottomata: Remove 'epic' from the notice message for check_puppetrun [puppet] - 10https://gerrit.wikimedia.org/r/162937 [18:16:00] ottomata: gah, why is that in the 'base' module rather than somewhere else where we keep our other custom checks? [18:16:01] sigh [18:16:26] (03PS2) 10Reedy: group0 to 1.25wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162914 [18:16:32] (03CR) 10Gage: [C: 032] "OMG LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/162937 (owner: 10Ottomata) [18:16:40] dunno! [18:17:24] (03Abandoned) 10Dzahn: add README.md to all modules [puppet] - 10https://gerrit.wikimedia.org/r/161634 (owner: 10Dzahn) [18:17:34] merged your puppet message change [18:17:37] (03CR) 10Reedy: [C: 032] group0 to 1.25wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162914 (owner: 10Reedy) [18:17:39] and mischanned [18:17:45] (03Merged) 10jenkins-bot: group0 to 1.25wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162914 (owner: 10Reedy) [18:17:56] "git grep" needs a "git replace" companion [18:18:11] * tonythomas wants to try !log to log the new instance made, and thinks whether that would work [18:18:14] oh there's already a "git replace" that does something else [18:18:17] "git sed" ? [18:18:28] is that like simon says? [18:19:29] i had not used git sed [18:19:40] though i did just use sed in my mw dns changes for mgmt [18:20:00] bblack: bah, thats not a real command [18:20:09] ;p [18:20:17] haha [18:20:23] (03PS1) 10Aude: Bump $wgCacheEpoch for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162938 [18:20:37] there is nothign that cannot be accomplished with the proper application of sed and/or awk [18:20:44] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf1 [18:21:01] aude: I should've known that was coming ;) [18:21:04] hah [18:22:27] (03PS2) 10Reedy: Bump $wgCacheEpoch for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162938 (owner: 10Aude) [18:22:31] omg, Cannot use object of type stdClass as array [18:22:32] (03CR) 10Reedy: [C: 032] Bump $wgCacheEpoch for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162938 (owner: 10Aude) [18:22:35] SpecialPageFactory [18:22:40] Again? [18:22:41] WUUUUT [18:22:45] rage [18:22:50] * aude looks [18:22:52] where? [18:22:59] fluorine [18:23:04] (03Merged) 10jenkins-bot: Bump $wgCacheEpoch for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162938 (owner: 10Aude) [18:23:39] mobile [18:23:59] http://www.mediawiki.org/w/index.php?title=Special:Search&search=kitten&fulltext=Search&mobileaction=toggle_view_desktop [18:24:28] PROBLEM - NTP on db1071 is CRITICAL: NTP CRITICAL: No response from NTP server [18:25:08] PROBLEM - NTP on snapshot1004 is CRITICAL: NTP CRITICAL: No response from NTP server [18:25:09] Fatal error: Call to a member function getFullURL() on a non-object [18:25:26] Yeah, I just pasted that in -mobile, MaxSem is looking at it [18:25:49] PROBLEM - NTP on db1064 is CRITICAL: NTP CRITICAL: No response from NTP server [18:26:09] k [18:26:12] I wish we had a role::mail::mx here in wikitech 'Configure instance' page [18:26:18] i see two things, but probably related [18:26:19] PROBLEM - NTP on db1063 is CRITICAL: NTP CRITICAL: No response from NTP server [18:28:12] PROBLEM - NTP on virt1000 is CRITICAL: NTP CRITICAL: No response from NTP server [18:28:15] 8 Fatal error: Cannot use object of type stdClass as array in /srv/mediawiki/php-1.25wmf1/includes/specialpage/SpecialPageFactory.php on line 281 [18:28:20] aude: just appeared in the apache syslogs too :( [18:28:54] I see 281 as self::$aliases[$caseFoldedAlias] = $name; [18:29:03] i see [18:29:07] APC? [18:29:18] PROBLEM - NTP on copper is CRITICAL: NTP CRITICAL: No response from NTP server [18:29:24] host outta sync? [18:29:27] Noting I just reverted Daniels change from 1.24wmf22 [18:29:28] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:29:29] PROBLEM - NTP on bast1001 is CRITICAL: NTP CRITICAL: No response from NTP server [18:29:48] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:29:52] MaxSem: various hosts at least [18:29:58] RECOVERY - puppet last run on db72 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:30:02] at least 8 different ones [18:30:09] PROBLEM - NTP on virt1008 is CRITICAL: NTP CRITICAL: Offset unknown [18:30:34] heh [18:31:06] (03PS3) 10Reedy: Add OTRS-member group to fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162345 (https://bugzilla.wikimedia.org/54368) (owner: 10Reza) [18:31:11] (03CR) 10Reedy: [C: 032] Add OTRS-member group to fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162345 (https://bugzilla.wikimedia.org/54368) (owner: 10Reza) [18:31:15] (03Merged) 10jenkins-bot: Add OTRS-member group to fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162345 (https://bugzilla.wikimedia.org/54368) (owner: 10Reza) [18:31:40] PROBLEM - NTP on virt1005 is CRITICAL: NTP CRITICAL: Offset unknown [18:32:00] the virt ones are probably real, the others seems transient [18:32:05] (ntp issues) [18:32:08] PROBLEM - NTP on ms-be3004 is CRITICAL: NTP CRITICAL: No response from NTP server [18:33:08] PROBLEM - NTP on virt0 is CRITICAL: NTP CRITICAL: No response from NTP server [18:33:08] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [18:33:18] RECOVERY - NTP on snapshot1004 is OK: NTP OK: Offset -0.003938317299 secs [18:33:36] (03PS3) 10Reedy: Flow enable mw:Talk:Mediawiki UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162486 (https://bugzilla.wikimedia.org/71204) (owner: 10EBernhardson) [18:33:38] RECOVERY - NTP on db1071 is OK: NTP OK: Offset -0.001082658768 secs [18:33:42] (03CR) 10Reedy: [C: 032] Flow enable mw:Talk:Mediawiki UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162486 (https://bugzilla.wikimedia.org/71204) (owner: 10EBernhardson) [18:33:47] (03Merged) 10jenkins-bot: Flow enable mw:Talk:Mediawiki UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162486 (https://bugzilla.wikimedia.org/71204) (owner: 10EBernhardson) [18:33:57] yurikR1: 'label:Free Wikipedia from MTN Rwanda' [18:34:16] what, again? :( [18:34:26] in some cases, ntp didn't restart, probably a low-probability failure in the initscript [18:34:30] greg-g: I was wondering if it would be possible to do an out of band config deployment. Basically just want to turn on WikiGrok on en.wiki prior to our quarterly planning meeting at 1 (but had to wait until en.wiki was on wmf22 which was just a few minutes ago). Is that kosher or verbotin? [18:34:49] PROBLEM - NTP on virt1009 is CRITICAL: NTP CRITICAL: Offset unknown [18:34:52] (03PS2) 10Reedy: Add delete right to fawiki Image-reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162565 (https://bugzilla.wikimedia.org/71229) (owner: 10Reza) [18:34:56] (03CR) 10Reedy: [C: 032] Add delete right to fawiki Image-reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162565 (https://bugzilla.wikimedia.org/71229) (owner: 10Reza) [18:35:01] (03Merged) 10jenkins-bot: Add delete right to fawiki Image-reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162565 (https://bugzilla.wikimedia.org/71229) (owner: 10Reza) [18:35:17] kaldari: I'm still deploying shit... [18:35:20] kaldari: I can just do it [18:35:32] Reedy: that would be awesome... [18:35:52] Reedy: the config change is https://gerrit.wikimedia.org/r/#/c/158512/ [18:35:54] (03PS3) 10Reedy: Enable WikiGrok for prototype testing on enwiki mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158512 (owner: 10Kaldari) [18:36:00] RECOVERY - NTP on db1064 is OK: NTP OK: Offset -0.008604764938 secs [18:36:08] PROBLEM - NTP on logstash1003 is CRITICAL: NTP CRITICAL: No response from NTP server [18:36:18] PROBLEM - NTP on vanadium is CRITICAL: NTP CRITICAL: No response from NTP server [18:36:19] RECOVERY - NTP on virt1000 is OK: NTP OK: Offset -0.004755735397 secs [18:36:19] RECOVERY - NTP on db1063 is OK: NTP OK: Offset -0.00736105442 secs [18:36:24] (03CR) 10Reedy: [C: 032] Enable WikiGrok for prototype testing on enwiki mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158512 (owner: 10Kaldari) [18:36:28] (03Merged) 10jenkins-bot: Enable WikiGrok for prototype testing on enwiki mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158512 (owner: 10Kaldari) [18:36:29] RECOVERY - NTP on bast1001 is OK: NTP OK: Offset -0.001751184464 secs [18:36:30] huh, i can reproduce now [18:36:45] * aude had mobile extension but not updated it for a month [18:37:00] Call to a member function getFullURL() [18:37:18] PROBLEM - NTP on es1008 is CRITICAL: NTP CRITICAL: No response from NTP server [18:37:21] PROBLEM - NTP on db73 is CRITICAL: NTP CRITICAL: No response from NTP server [18:37:28] (03PS3) 10Reedy: Add "viewdeletedfile" userright for global deleted image review [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162546 (https://bugzilla.wikimedia.org/14801) (owner: 10Legoktm) [18:37:45] aude: [19:35:36] Reedy, aha - that's some core change that broke MW:P [18:37:51] PROBLEM - NTP on db1065 is CRITICAL: NTP CRITICAL: No response from NTP server [18:38:04] was really meaning MF [18:38:04] oh noes [18:38:09] PROBLEM - NTP on analytics1040 is CRITICAL: NTP CRITICAL: No response from NTP server [18:38:19] RECOVERY - NTP on copper is OK: NTP OK: Offset 0.001903653145 secs [18:38:25] (03CR) 10Reedy: [C: 032] Add "viewdeletedfile" userright for global deleted image review [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162546 (https://bugzilla.wikimedia.org/14801) (owner: 10Legoktm) [18:38:30] (03Merged) 10jenkins-bot: Add "viewdeletedfile" userright for global deleted image review [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162546 (https://bugzilla.wikimedia.org/14801) (owner: 10Legoktm) [18:38:59] RECOVERY - NTP on ms-be3004 is OK: NTP OK: Offset -0.002804994583 secs [18:39:08] PROBLEM - NTP on ms-fe1001 is CRITICAL: NTP CRITICAL: No response from NTP server [18:39:19] PROBLEM - NTP on db1066 is CRITICAL: NTP CRITICAL: No response from NTP server [18:39:41] (03PS2) 10Reedy: Only use the RSS proxy on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162532 (owner: 10Legoktm) [18:39:44] (03CR) 10Reedy: [C: 032] Only use the RSS proxy on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162532 (owner: 10Legoktm) [18:39:49] (03Merged) 10jenkins-bot: Only use the RSS proxy on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162532 (owner: 10Legoktm) [18:40:03] greg-g: where's the release at? We found a major bug in Flow master [18:40:16] (03PS3) 10Reedy: Don't allow granting a removed group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160372 (owner: 10Jackmcbarn) [18:40:19] Reedy: Just let me know when the config change is live so I can test. Thanks! [18:40:20] (03CR) 10Reedy: [C: 032] Don't allow granting a removed group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160372 (owner: 10Jackmcbarn) [18:40:25] (03Merged) 10jenkins-bot: Don't allow granting a removed group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160372 (owner: 10Jackmcbarn) [18:40:33] spagewmf: 1.25wmf1? It's on group0 [18:40:35] spagewmf: Reedy doing things now it looks like [18:40:40] PROBLEM - NTP on db1067 is CRITICAL: NTP CRITICAL: Offset unknown [18:40:40] PROBLEM - NTP on logstash1002 is CRITICAL: NTP CRITICAL: Offset unknown [18:40:42] I BLAME ^d ! :P [18:41:08] RECOVERY - NTP on virt0 is OK: NTP OK: Offset -0.003692388535 secs [18:41:18] PROBLEM - NTP on tin is CRITICAL: NTP CRITICAL: Offset unknown [18:41:18] PROBLEM - NTP on snapshot1001 is CRITICAL: NTP CRITICAL: Offset unknown [18:41:33] or not? [18:41:37] silly blame [18:41:47] PROBLEM - NTP on ruthenium is CRITICAL: NTP CRITICAL: Offset unknown [18:42:07] PROBLEM - NTP on gallium is CRITICAL: NTP CRITICAL: Offset unknown [18:42:10] PROBLEM - NTP on virt1004 is CRITICAL: NTP CRITICAL: Offset unknown [18:42:20] PROBLEM - NTP on virt1003 is CRITICAL: NTP CRITICAL: Offset unknown [18:42:49] (03CR) 10Reedy: [C: 04-1] "Extension should be added to extension-list-labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162358 (https://bugzilla.wikimedia.org/71188) (owner: 10Gergő Tisza) [18:43:07] RECOVERY - NTP on analytics1040 is OK: NTP OK: Offset -0.003607153893 secs [18:43:07] PROBLEM - NTP on db1052 is CRITICAL: NTP CRITICAL: Offset unknown [18:43:13] (03PS2) 10Reedy: Add 'unwatchedpages' right to 'patroller' user group on he.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162568 (https://bugzilla.wikimedia.org/71193) (owner: 10Calak) [18:43:17] PROBLEM - NTP on sodium is CRITICAL: NTP CRITICAL: Offset unknown [18:43:17] (03CR) 10Reedy: [C: 032] Add 'unwatchedpages' right to 'patroller' user group on he.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162568 (https://bugzilla.wikimedia.org/71193) (owner: 10Calak) [18:43:18] RECOVERY - NTP on tin is OK: NTP OK: Offset -0.01655626297 secs [18:43:24] (03Merged) 10jenkins-bot: Add 'unwatchedpages' right to 'patroller' user group on he.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162568 (https://bugzilla.wikimedia.org/71193) (owner: 10Calak) [18:43:27] RECOVERY - NTP on db73 is OK: NTP OK: Offset -0.004716277122 secs [18:43:27] RECOVERY - NTP on snapshot1001 is OK: NTP OK: Offset -0.006387591362 secs [18:43:32] afaik the bulk of the ntp daemon issues are fixed now (all cases of ntpd failing to start when puppet restarted it), but it's taking a while for icinga to catch up. [18:43:48] RECOVERY - NTP on db1065 is OK: NTP OK: Offset -0.002247571945 secs [18:43:48] RECOVERY - NTP on db1067 is OK: NTP OK: Offset -0.00986456871 secs [18:43:48] RECOVERY - NTP on logstash1002 is OK: NTP OK: Offset -0.003009080887 secs [18:43:48] RECOVERY - NTP on ruthenium is OK: NTP OK: Offset -0.007750034332 secs [18:44:07] RECOVERY - NTP on gallium is OK: NTP OK: Offset -0.003201127052 secs [18:44:07] RECOVERY - NTP on ms-fe1001 is OK: NTP OK: Offset -0.00598192215 secs [18:44:07] RECOVERY - NTP on es1008 is OK: NTP OK: Offset -0.004096031189 secs [18:44:07] RECOVERY - NTP on logstash1003 is OK: NTP OK: Offset -0.003072619438 secs [18:44:07] RECOVERY - NTP on db1052 is OK: NTP OK: Offset -0.004544854164 secs [18:44:17] RECOVERY - NTP on db1066 is OK: NTP OK: Offset -0.005947589874 secs [18:44:17] RECOVERY - NTP on vanadium is OK: NTP OK: Offset -0.01206386089 secs [18:44:55] PROBLEM - NTP on virt1001 is CRITICAL: NTP CRITICAL: Offset unknown [18:45:17] (03CR) 10Reedy: "Should it be so big?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162638 (https://bugzilla.wikimedia.org/70996) (owner: 10Glaisher) [18:45:26] RECOVERY - NTP on sodium is OK: NTP OK: Offset -0.01545155048 secs [18:45:26] PROBLEM - NTP on virt1007 is CRITICAL: NTP CRITICAL: Offset unknown [18:46:14] Jeff_Green: ok. our mx running at 10.68.16.222 [18:46:31] ok [18:46:59] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 14s) [18:47:03] kaldari: ^^ should be live [18:47:33] Reedy: thanks [18:47:35] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [18:47:40] (03PS2) 10Gergő Tisza: Deploy ImageMetrics extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162358 (https://bugzilla.wikimedia.org/71188) [18:48:07] tonythomas: woo, deployment-mediawiki02 -> verpmxhost tcp/25 is open :-P [18:48:29] Jeff_Green: that looks great. [18:48:53] and one more question - -why would the bounce come back necessarily to verpmxhost ? [18:48:55] tonythomas: so, I don't know the social impact of hijacking outbound mail from deployment-mediawiki02, i.e. who else is using it that will be affected [18:49:16] tonythomas: they'll never leave verpmxhost [18:49:34] chrismcmahonbrb: bd808|LUNCH greg-g ^ (re- outgoing email from betacluster) [18:49:42] Jeff_Green: it can route to external world though. [18:50:11] you'll send a message from mediawiki on deployment-mediawiki02, which will enter the local mail spool, and then it will either be smartrouted or iptables-hijacked and sent to verpmxhost as the next mail hop [18:50:27] * greg-g has no idea re mail use [18:50:40] then on verpmxhost you'll be able to mess with exim to produce whatever response you want [18:51:21] Jeff_Green: ok. and proper emails will reach out through our verpmxhost right ? [18:51:22] hrmm, actually .. .. [18:51:33] another problem [18:51:41] I saw that last time. so that any proper delivery wont get affected [18:51:46] garg, we are really not tooled well for mail testing [18:52:00] ? [18:52:12] if you're going to simulate production plus the outside world you really need three hosts [18:52:31] 3 ? [18:52:41] polonium can directly POST to the production API ? [18:52:49] webserver (deployment-mediawiki02) --> mx --> box to emulate the remote side of the mail transaction [18:53:31] Jeff_Green: emulate the remote side ? [18:54:01] yes, to behave like the recipient's ISP's mail server [18:54:16] to create a bounce right ? [18:54:33] right [18:54:55] (03PS3) 10Reedy: Deploy ImageMetrics extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162358 (https://bugzilla.wikimedia.org/71188) (owner: 10Gergő Tisza) [18:54:58] but that would mean -- we might disrupt the email-ability of beta right ? [18:55:04] (03PS4) 10Reedy: Deploy ImageMetrics extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162358 (https://bugzilla.wikimedia.org/71188) (owner: 10Gergő Tisza) [18:55:04] yep [18:55:10] (03CR) 10Reedy: [C: 032] Deploy ImageMetrics extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162358 (https://bugzilla.wikimedia.org/71188) (owner: 10Gergő Tisza) [18:55:14] (03Merged) 10jenkins-bot: Deploy ImageMetrics extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162358 (https://bugzilla.wikimedia.org/71188) (owner: 10Gergő Tisza) [18:55:34] tonythomas: i'm thinking this is better to test in the verp project if that's possible [18:55:47] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 14s) [18:55:59] I don't really want to make a mockery of the deployment project and block other testing [18:56:01] permanent errors like lookup problems ( like the one we had last time ) could be emulated with just verpmxhost trying to connect external google mx hosts [18:57:09] tonythomas: i.e. if verpmxhost is acting as our mx, yeah, that's true. and that would allow you to test our exim mx config on verpmxhost itself [18:57:27] (03PS1) 10BBlack: NTP aliases for installer: ntp.$site.wm.o [dns] - 10https://gerrit.wikimedia.org/r/162946 [18:58:07] (03PS2) 10BBlack: switch debian installer to new NTP alises [puppet] - 10https://gerrit.wikimedia.org/r/162932 [18:58:30] eh, logging from mw to logstash is broken. <-- ping bd808|LUNCH and ori [18:58:35] and Reedy ^ [18:58:41] (03CR) 10BBlack: [C: 032] NTP aliases for installer: ntp.$site.wm.o [dns] - 10https://gerrit.wikimedia.org/r/162946 (owner: 10BBlack) [18:58:46] orly? [18:59:02] yup, only non-mw crap in https://logstash.wikimedia.org/#/dashboard/elasticsearch/default [18:59:29] (03PS3) 10BBlack: switch debian installer to new NTP aliases [puppet] - 10https://gerrit.wikimedia.org/r/162932 [18:59:36] Not sure I quite know how to debug that [18:59:56] (03CR) 10BBlack: [C: 032 V: 032] switch debian installer to new NTP aliases [puppet] - 10https://gerrit.wikimedia.org/r/162932 (owner: 10BBlack) [19:00:29] Jeff_Green: ok. even to get that done - we will have to do the exim or iptables change [19:00:59] tonythomas: correct [19:01:35] another option I guess would be to create another host like deployment-mediawiki02 for testing [19:02:27] fluorine seems to still receive MW logs [19:03:20] Jeff_Green: I think we already have that one too [19:03:24] mediawiki-verp [19:03:37] yeah [19:04:23] MaxSem: What about these errors then? :P [19:04:30] pokingggg [19:05:09] Jeff_Green: running role::webserver with a full install of mediawiki too [19:05:24] tonythomas: yes [19:05:39] so.... [19:05:50] I really am not very familiar with our deploymnent testing [19:05:57] Reedy: can you deploy the CA change now or should I ask o.ri to do it? [19:06:01] I have never had reason to use it [19:06:10] legoktm: I can [19:06:22] Reedy: backports are https://gerrit.wikimedia.org/r/#/q/I7a22fa5825e2831d4c43f85094121ec56e7cf290,n,z [19:06:56] tonythomas: so I can tell you what I would do for an ideal testing environment, but I can't say "do XYZ in the deployment project" [19:08:29] Jeff_Green: ok. So we stick back to our verpverpverp and mediawiki-verp, maybe ? [19:08:43] and the third one that can do manipulations -- if needed [19:09:46] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:09:46] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:10:13] tonythomas: yes, I think that's a place you can tool to do meaningful testing. but I don't know whether that can satisfy our org's deployment testing requirements [19:10:16] does that make sense? [19:11:20] Jeff_Green: true. in that case, the best option I think can do would be to make deployement-wiki route through our host. [19:11:58] right now all I can say is that our deployment test environment does not adequately model production to do meaningful mailsystem testing [19:12:55] RECOVERY - NTP on virt1001 is OK: NTP OK: Offset -0.006863594055 secs [19:13:56] Jeff_Green: I wish we had a separate mx for beta. atleast one between polonium and beta [19:14:19] agreed. we need that if we're going to test the mailsystem [19:15:29] (03PS1) 10BBlack: temp fix for virt100X ntp [puppet] - 10https://gerrit.wikimedia.org/r/162954 [19:15:55] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [19:16:06] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [19:16:13] (03CR) 10jenkins-bot: [V: 04-1] temp fix for virt100X ntp [puppet] - 10https://gerrit.wikimedia.org/r/162954 (owner: 10BBlack) [19:16:27] YuviPanda: I don't understand the question about email from beta cluster... [19:16:57] chrismcmahon: I think tonythomas and Jeff_Green green might be playing with it atm, and thought you should know [19:17:11] no, nothing impacted [19:17:12] YuviPanda: ah, OK. no problem [19:17:32] chrismcmahon: you're asking about what tony and I are mulling? [19:17:39] or something else? [19:17:44] (03PS2) 10BBlack: temp fix for virt100X ntp [puppet] - 10https://gerrit.wikimedia.org/r/162954 [19:17:50] !log reedy Synchronized php-1.24wmf22/extensions/CentralAuth/: (no message) (duration: 00m 14s) [19:18:25] Reedy: Sorry it took so lnog, ran into a merge conflict. Here it is: https://gerrit.wikimedia.org/r/#/c/162955 [19:18:45] (03CR) 10BBlack: [C: 032] temp fix for virt100X ntp [puppet] - 10https://gerrit.wikimedia.org/r/162954 (owner: 10BBlack) [19:18:52] !log reedy Synchronized php-1.25wmf1/: (no message) (duration: 00m 55s) [19:20:26] git bisect tells me https://gerrit.wikimedia.org/r/#/c/161499/ is the issue [19:20:40] MaxSem: Reedy ^ [19:21:49] aude, I reverted seconds before your report:P [19:21:55] yay, thanks [19:22:31] !log ntp work done on hosts [19:24:28] !log reedy Synchronized php-1.24wmf22/resources/src/mediawiki.ui/components/buttons.less: (no message) (duration: 00m 14s) [19:25:37] heh [19:26:15] RECOVERY - NTP on virt1005 is OK: NTP OK: Offset -0.007097601891 secs [19:26:35] RECOVERY - NTP on virt1004 is OK: NTP OK: Offset -0.002235174179 secs [19:26:45] RECOVERY - NTP on virt1007 is OK: NTP OK: Offset -0.004180550575 secs [19:26:55] RECOVERY - NTP on virt1003 is OK: NTP OK: Offset -0.003189563751 secs [19:27:01] RECOVERY - NTP on virt1008 is OK: NTP OK: Offset -0.003887534142 secs [19:27:02] morebots: ? [19:27:02] I am a logbot running on tools-exec-11. [19:27:02] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [19:27:02] To log a message, type !log . [19:27:05] RECOVERY - NTP on virt1009 is OK: NTP OK: Offset -0.007123827934 secs [19:27:42] uh [19:27:47] legoktm: it's not been logging, has it? :/ [19:27:51] it's logging [19:27:57] just not acknowledging [19:28:31] SAL got vandalisez [19:28:32] d [19:28:47] moved [19:29:32] https://wikitech.wikimedia.org/wiki/Special:Log/move [19:29:33] heh [19:29:36] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [19:29:43] * aude wouldn't mind being admin on wikitech to handle these things in the future ;) [19:30:24] There's probably a handful of people that should had sysop tbh [19:30:35] !log reedy Synchronized php-1.25wmf1/: (no message) (duration: 00m 46s) [19:30:38] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset 0.002018 secs [19:32:45] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [19:34:46] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset -0.001627 secs [19:35:08] (03PS1) 10Reedy: wgMemoryLimit to 300MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162959 [19:37:06] What's up with mw1051? [19:37:12] It looks pretty lazy according to ganglia [19:37:35] mw1163 too, but that has known memory problems [19:39:19] Actually, if someone wants to look at the kern.log on mw1163... [19:39:30] https://rt.wikimedia.org/Ticket/Display.html?id=8243 [19:43:18] Reedy: Thanks for the last minute deploy. You saved my demo in front of Erik and Lila in 20 minutes :) [19:43:41] (03PS1) 10Ottomata: Use $::instanceproject as Hadoop user group in labs [puppet] - 10https://gerrit.wikimedia.org/r/162961 [19:43:43] whee [19:47:07] MaxSem: did you get anyone to look at logstash? [19:47:18] nope [19:47:26] I'm on it then. [19:47:37] ori is commuting and I don't have access yet ;) [19:47:39] thanks:) [19:49:43] !log Restarted logstash on logstash1001. udp2log events were not being recorded. [19:51:27] MaxSem, Reedy: looks like its working again now. This happens "occasionally" and the fix is almost always to do `sudo service logstash stop; sudo service logstash start` on logstash1001 [19:52:30] There is a bug open in bugzilla about it getting stuck. I need to find time to upgrade all the software in that stack to newer versions to see if that makes things better. [19:52:50] (03PS3) 10BBlack: Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162751 [19:52:57] (03CR) 10BBlack: [C: 032 V: 032] Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162751 (owner: 10BBlack) [19:53:04] And also get my logging patches merged so we can use a better shipping mechanism than udp2log packets [19:55:03] (03PS1) 10Ottomata: Use nasty exec hack to avoid mount -> file dependency issues [puppet/cdh] - 10https://gerrit.wikimedia.org/r/162963 [19:55:18] jgage: ^ [19:55:35] greg-g, Jeff_Green, tonythomas: outbound email from beta would be used for account email confirmation and password recovery. Probably not much else. [19:55:53] Maybe echo stuff too I guess [19:55:56] * greg-g nods [19:56:14] I don't think it's relied upon for testing, for instance, but it would suck if password resets weren't working [19:59:23] bd808: tony's off for the night, but what we're looking for is a way to meaningfully test the whole mail path [20:00:05] i.e. to test the mx (equivalent to production's polonium) itself, including inbound mail if only from a labs test server [20:00:18] Jeff_Green: Sure. I think it would be fine to leave for a day or two for testing that out. Just don't want to lose beta email forever. [20:00:36] Jeff_Green: a word of caution/plea: please don't make it so people can't reset their passwords (for instance). If we need to wait until the second cluster is operational we should. [20:00:50] right. i guess the question is whether it makes sense to do this in beta at all [20:01:14] he can do essentially the same testing in the verp labs project [20:02:01] gotcha [20:12:05] PROBLEM - NTP on dbstore1001 is CRITICAL: NTP CRITICAL: Offset unknown [20:13:24] ^ accidental fallout of the resync stuff, it wil come back eventually [20:13:25] (03PS1) 10Prtksxna: Flow enable mw:Talk:MediaWiki UI (fix typo) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162965 (https://bugzilla.wikimedia.org/71204) [20:14:15] PROBLEM - NTP on wtp1020 is CRITICAL: NTP CRITICAL: Offset unknown [20:15:05] RECOVERY - NTP on dbstore1001 is OK: NTP OK: Offset -0.0004177093506 secs [20:17:08] (03PS1) 10Yuvipanda: icinga: Move global monitoring hostgroups into module [puppet] - 10https://gerrit.wikimedia.org/r/162966 [20:17:10] (03PS1) 10Yuvipanda: nagios_common: Move check_ganglia into module [puppet] - 10https://gerrit.wikimedia.org/r/162967 [20:17:17] RECOVERY - NTP on wtp1020 is OK: NTP OK: Offset 7.450580597e-05 secs [20:24:00] db1066.eqiad.wmnet looks sick. There are 168 errors in logstash in the last hour mentioning it (10.64.48.21). They seem to mostly be about lost connecitons. [20:28:03] (03PS1) 10Yuvipanda: icinga: Remove ganglios checks [puppet] - 10https://gerrit.wikimedia.org/r/162969 [20:28:08] bd808: huge load seemingly just after deploy [20:28:09] https://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&c=MySQL+eqiad&h=db1066.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=ALLGROUPS [20:28:34] enwiki api server [20:29:08] yikes [20:29:24] It has a load of [20:29:25] "SELECT /* ApiQueryAllUsers::execute Cyberbot I */" [20:29:41] lol [20:29:53] !log repooled mw1051 [20:35:25] PROBLEM - puppet last run on amssq31 is CRITICAL: CRITICAL: puppet fail [20:36:37] !log manually migrated "NickK" to a global account [20:36:56] !log no !log [20:37:19] it's logging to SAL, just morebots isn't confirming [20:37:28] ah [20:39:11] (03PS2) 10Milimetric: Configure archive table name [puppet] - 10https://gerrit.wikimedia.org/r/162293 [20:42:36] (03PS2) 10Milimetric: Make archive table name configurable [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/162292 [20:43:48] (03PS3) 10Milimetric: Configure archive table name [puppet] - 10https://gerrit.wikimedia.org/r/162293 [20:44:27] (03CR) 10jenkins-bot: [V: 04-1] Configure archive table name [puppet] - 10https://gerrit.wikimedia.org/r/162293 (owner: 10Milimetric) [20:45:06] ottomata: Maybe I'm doing this wrong [20:45:37] I'm updating the wikimetrics submodule to an un-merged change from gerrit [20:45:57] i figured since the SHA of the commit will stay the same, this saved time [20:46:12] but it's giving it -1 [20:46:47] FATAL: Command "submodule update" returned status code 1: [20:46:48] 20:44:01 stdout: [20:46:48] 20:44:01 stderr: fatal: reference is not a tree: b886ca331045432afe76882f184c0cdd4a077f3d [20:46:51] that's jenkins [20:46:56] trying to run git submodule update [20:46:58] https://integration.wikimedia.org/ci/job/operations-puppet-typos/21235/console [20:47:02] oooh, ok [20:47:03] but since the commit hasn't been merged on teh submodule [20:47:06] its all un happy [20:47:16] so is this correct and I just have to wait for the merge? [20:47:23] ja that should be ok [20:47:36] ok then, this and the related change are ready for you to review [20:47:36] shall I merge that? [20:47:38] k [20:47:44] (03CR) 10Ottomata: [C: 032 V: 032] Make archive table name configurable [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/162292 (owner: 10Milimetric) [20:47:58] (03CR) 10Ottomata: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/162293 (owner: 10Milimetric) [20:48:07] says saying 'recheck' in a comment make jenkins recheck it? [20:48:17] :) [20:48:28] does saying? [20:48:30] that wa sa question [20:48:43] yes [20:49:02] ah cool [20:49:02] worked [20:49:17] (03CR) 10Ottomata: [C: 032 V: 032] Configure archive table name [puppet] - 10https://gerrit.wikimedia.org/r/162293 (owner: 10Milimetric) [20:49:57] thanks ottomata! [20:54:26] RECOVERY - puppet last run on amssq31 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:54:29] I'm seeing an l10n cache issue in media viewer, can the next person to deploy please scap? [20:55:02] marktraceur: Any details that may tell us why it broke? [20:55:30] Not really. [20:55:35] I see the right message value in the API [20:55:43] And I know it's using the right message 'cause of uselang=qqx [20:56:38] new message or one that has been around for a while? [20:57:27] It's been around, its value changed [20:59:06] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:59:06] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:59:43] (03CR) 10Gage: [C: 031] "discussed in IRC. FUSE is weird!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/162963 (owner: 10Ottomata) [21:02:29] (03CR) 10Ottomata: [C: 032 V: 032] Use nasty exec hack to avoid mount -> file dependency issues [puppet/cdh] - 10https://gerrit.wikimedia.org/r/162963 (owner: 10Ottomata) [21:03:21] (03PS2) 10Ottomata: Mount HDFS at /mnt/hdfs read only on role::analytics::clients (stat1002 and analytics1027) [puppet] - 10https://gerrit.wikimedia.org/r/162930 [21:04:46] (03PS3) 10Ottomata: Mount HDFS at /mnt/hdfs read only on role::analytics::clients (stat1002 and analytics1027) [puppet] - 10https://gerrit.wikimedia.org/r/162930 [21:04:53] (03CR) 10Ottomata: [C: 032 V: 032] Mount HDFS at /mnt/hdfs read only on role::analytics::clients (stat1002 and analytics1027) [puppet] - 10https://gerrit.wikimedia.org/r/162930 (owner: 10Ottomata) [21:06:15] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [21:06:15] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:16:05] PROBLEM - Disk space on analytics1027 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: Transport endpoint is not connected [21:17:13] hmm [21:24:45] * SadPanda gives ^d a vague ping about ES metrics in graphite [21:32:36] PROBLEM - puppet last run on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:34:26] PROBLEM - DPKG on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:34:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:34:36] PROBLEM - Disk space on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:34:45] PROBLEM - check if dhclient is running on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:34:45] PROBLEM - check configured eth on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:34:55] PROBLEM - SSH on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:34:56] PROBLEM - nutcracker port on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:34:56] PROBLEM - nutcracker process on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:35:30] looks pretty dead [21:35:48] it's resting [21:37:34] guess that means no more hhvm jobrunner [21:41:03] !log powercycling mw1053 [21:41:09] Logged the message, Master [21:43:06] RECOVERY - Disk space on analytics1027 is OK: DISK OK [21:43:26] RECOVERY - DPKG on mw1053 is OK: All packages OK [21:43:45] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [21:43:45] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 1979 seconds ago with 0 failures [21:43:45] RECOVERY - Disk space on mw1053 is OK: DISK OK [21:43:46] RECOVERY - check configured eth on mw1053 is OK: NRPE: Unable to read output [21:43:46] RECOVERY - check if dhclient is running on mw1053 is OK: PROCS OK: 0 processes with command name dhclient [21:43:55] RECOVERY - SSH on mw1053 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [21:43:56] RECOVERY - nutcracker port on mw1053 is OK: TCP OK - 0.000 second response time on port 11212 [21:44:07] RECOVERY - nutcracker process on mw1053 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [21:45:39] mutante: thanks [21:45:49] mutante: was it completely unresponsive? [21:46:23] ori: on mgmt it was repeating the "Ubuntu .. login:" message over and over.. without me hitting a key.. and i couldn't get in [21:46:42] ssh was unresponsive, yea [21:47:16] <_joe_> if you waited 10 minutes, it would've come back :) [21:47:31] <_joe_> I know my little beloved jobrunner [21:47:49] lol [21:53:15] PROBLEM - Disk space on analytics1027 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: Transport endpoint is not connected [21:53:25] another issue, this finds 0 , but still ok .. PROCS OK: 0 processes with command name 'dhclient' [21:55:18] mutante: isn't that the point? [21:55:25] If it's still running it's a problem? [21:55:56] Reedy: yea, you're right [21:59:01] (03PS1) 10Ottomata: Ensure icedtea-7-jre-jamvm is absent on analytics::clients [puppet] - 10https://gerrit.wikimedia.org/r/163038 [22:00:22] (03PS2) 10Ottomata: Ensure icedtea-7-jre-jamvm is absent on analytics::clients [puppet] - 10https://gerrit.wikimedia.org/r/163038 [22:00:29] greg-g: We've determined that the MMV message issue can be solved by refreshMessageBlobs.php, but tgr says it takes a while - should I tell a SWATter to do it or try to do it this hour? [22:01:02] cc bd808, legoktm. [22:01:09] probably best to do it out of swat [22:01:14] Depending on how much of an issue it is... [22:01:20] l10nupdate is going to run in a few hours [22:01:26] That'll run refreshMessageBlobs.php [22:01:26] That's probably fine [22:01:52] that too, wasn't sure if it was BLOCKER/IMMEDIATE [22:02:11] Asking product if it's a big issue. I suspect not. [22:02:40] There are wrong words on a wiki! [22:02:50] omg [22:02:53] Stop everything [22:03:13] (03PS3) 10Ottomata: Ensure icedtea-7-jre-jamvm is absent on analytics::clients [puppet] - 10https://gerrit.wikimedia.org/r/163038 [22:07:35] (03CR) 10Gage: [C: 031] Ensure icedtea-7-jre-jamvm is absent on analytics::clients [puppet] - 10https://gerrit.wikimedia.org/r/163038 (owner: 10Ottomata) [22:09:12] (03CR) 10Ottomata: [C: 032 V: 032] Ensure icedtea-7-jre-jamvm is absent on analytics::clients [puppet] - 10https://gerrit.wikimedia.org/r/163038 (owner: 10Ottomata) [22:09:37] why no jenkins? [22:10:17] RECOVERY - Disk space on analytics1027 is OK: DISK OK [22:13:51] (03CR) 10Dzahn: "broke wikidata monitoring it seems:" [puppet] - 10https://gerrit.wikimedia.org/r/161939 (owner: 10Yuvipanda) [22:14:21] something went wrong when moving the wikidata monitoring it looks [22:14:25] -bash: /usr/local/lib/nagios/plugins/check_wikidata: No such file or directory [22:20:59] SadPanda: ^^ [22:21:32] * SadPanda checks [22:21:52] <^d> SadPanda: ack'd. sorry for not getting to it. [22:21:59] <^demon|sick> ^ more accurate [22:22:07] ^demon|sick: awww! stay off computer, etc [22:22:18] ^demon|sick: getting rid of ganglia based checks in prod would also be nice :) [22:22:58] <^demon|sick> agreed :) [22:23:02] mutante: hmm, it's looking for it in the wrong path [22:23:05] RT #8456 [22:23:07] i quit #wikidata again [22:23:17] ^demon|sick: checks are in /usr/lib/nagios/plugins [22:23:17] SadPanda: ^ [22:23:47] mutante: haha! found it [22:24:30] patch incoming [22:25:45] (03PS1) 10Yuvipanda: nagios_common: Use macro to expand path to wikidata check [puppet] - 10https://gerrit.wikimedia.org/r/163045 [22:26:02] mutante: ^ can you merge? [22:26:58] mutante: none of the others seem to have a different path [22:27:09] andrewbogott: ^ (if you're still around) [22:27:51] (03CR) 10Dzahn: [C: 032] nagios_common: Use macro to expand path to wikidata check [puppet] - 10https://gerrit.wikimedia.org/r/163045 (owner: 10Yuvipanda) [22:27:58] mutante: w00t, ty [22:28:05] (03CR) 10Dzahn: "@neon:/etc/icinga# /usr/lib/nagios/plugins/check_wikidata" [puppet] - 10https://gerrit.wikimedia.org/r/163045 (owner: 10Yuvipanda) [22:28:24] mutante: yay [22:28:41] SadPanda: the difference between those path is.. comes from .deb .. or is installed by us [22:28:47] mutante: installed by us [22:28:47] but i'm sure it's all mixed up [22:28:54] mutante: for some reason it was installed by us in a different path [22:28:55] yes, i know [22:29:05] installed by us used to mean goes to /usr/local/ [22:29:10] mutante: right [22:29:16] but all of them now just go to /usr/lib/nagios/plugins [22:29:19] and had for a while :( [22:29:23] we should probably change it at some point [22:29:26] but i'm sure it's all mixed up [22:29:32] also I think some of our puppet config is just stuff that's also from some other deb [22:29:33] so all others were wrong :) [22:29:35] that we for some reason puppetized [22:29:45] mutante: well, some were wrong and some aren't, but we don't know which :) [22:29:59] wikidata was right.. before [22:30:00] for example, I'm pretty sure ftp, telnet, etc came by default with some deb and we just puppetized them [22:30:29] probably with nagios-plugins [22:30:39] yeah [22:30:45] Is jenkins stuck? [22:30:53] there are a bunch of them.. nagios-plugins-basic ... nagios-plugins-extra .. [22:32:42] SadPanda: puppet$ grep -r "local/lib/nagios" * [22:32:58] there are a bunch more in local, it wasn't just this one [22:33:39] mutante: yeah [22:33:56] * SadPanda wonders how to handle the rest of 'em [22:34:09] I suppose I could wholeseale move them to /usr/local [22:35:02] but that doesn't seem right, because some of these are default [22:35:07] but I guess we're puppet installing them [22:35:10] SadPanda: here, but looks like Daniel is on top of it [22:35:11] so they should be in /usr/local [22:35:32] technically the ones we install from puppet should be in /usr/local [22:35:33] andrewbogott: yeah :) do respond to my thoughts in -labs about generate / hosts if you've time [22:35:37] mutante: hmm, I agree [22:35:49] mutante: I'll move things around tomorrow, I think [22:36:05] mutante: we'd leave the machine itself in a reasonably fucked up state, tho :( [22:36:13] with lots of crap in /usr/lib that isn't being used [22:36:36] oh well [22:36:37] check_dpkg check_eth and check_puppetrun are left [22:36:49] so i think they will still work [22:36:58] looks for one of them in web ui [22:37:27] SadPanda: yea, for now i just want it to not break [22:37:41] mutante: yeah, I haven't touched any of those [22:37:54] mutante: I did a grep for the ones I had touched, and I don't see any of them with code in local [22:38:26] mutante: do you have rights to reschedule the wikidata check? [22:38:34] from https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wikidata&service=check+if+wikidata.org+dispatch+lag+is+higher+than+2+minutes [22:38:42] yes [22:38:49] mutante: can you? [22:39:38] done [22:40:53] SadPanda: hmm.. wikidata does not appear in checkcommands.cfg [22:41:15] mutante: indeed, commands have been modularized as well. nagios_common/files/check_commands/check_wikidata.cfg [22:41:25] along with check_wikidata the script in the same folder [22:42:49] SadPanda: where does it end up on the server? [22:42:56] mutante: /etc/icinga/commands [22:43:19] (03PS1) 10Kaldari: Turn off WikiGrok experiment pending fix for Bug 71335 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163048 [22:43:46] SadPanda: ah! i see it.. yea. but puppet did not change that yet [22:44:03] it still points to the ./local/ [22:44:15] runs puppet again [22:44:27] oh, that's weird [22:44:45] (03PS5) 10Catrope: Followup 6084646d: apply Mathoid directory creation hack to labs too [puppet] - 10https://gerrit.wikimedia.org/r/162811 [22:47:35] SadPanda: yea.. hmm.. finished catalog run but: [22:47:46] grep local check_wikidata.cfg command_line /usr/local/lib/nagios/plugins/check_wikidata [22:47:53] reopened the ticket [22:48:03] mutante: uh oh. can you paste puppetlog? [22:48:16] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: puppet fail [22:48:51] SadPanda: http://paste.debian.net/123160/ [22:49:41] mutante: hmm, I'm not fully sure what's happening. can you rm the current file, re-run puppet, see if it puts it back? [22:49:57] why is jenkins so far backlogged? i've never seen it this ba [22:49:59] bad* [22:50:10] (03CR) 10Hoo man: [C: 031] Removes hardcoded list of linked wikis in the "other projects" sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157062 (https://bugzilla.wikimedia.org/70169) (owner: 10Tpt) [22:51:08] SadPanda: you mean the .cfg file, right [22:51:16] mutante: yea [22:52:28] ok [22:55:31] SadPanda: /Nagios_common::Check_command::Config[check_wikidata]/File[/etc/icinga/commands/check_wikidata.cfg]/ensure: created [22:55:40] Error: /Stage[main]/Icinga::Monitor::Service/Service[icinga]: Failed to call refresh: Could not restart Service[icinga]: Execution of '/etc/init.d/icinga reload' returned 6: [22:55:43] mutante: does it have the proper path? [22:55:44] bah [22:55:47] that broke it [22:56:09] Error: Service check command 'check_wikidata' specified in service 'check if wikidata.org dispatch lag is higher than 2 minutes' for host 'wikidata' not defined anywhere! [22:56:20] mutante: what's in the check_wikidata.cfg file? [22:56:38] SadPanda: nothing :/ [22:56:43] check_wikidata.cfg: empty [22:56:43] wat [22:56:45] * SadPanda is confused [22:57:16] goddamit [22:57:19] * SadPanda is an utter idiot [22:57:53] (03PS1) 10Yuvipanda: nagios_common: Fix stupid copy paste error [puppet] - 10https://gerrit.wikimedia.org/r/163053 [22:57:54] mutante: ^ [22:57:59] !log ori Synchronized php-1.24wmf22/extensions/Wikidata: Update Wikidata for I0acd2096d21b (duration: 00m 11s) [22:58:04] Logged the message, Master [22:58:24] hmm. that changes all check commands? [22:58:50] mutante: yes, so I think everything *before* that check command worked, and things after didn't [22:59:04] I suppose the only reason icinga didn't complain was that all the things I moved after implementing that command aren't used anywhere? [22:59:28] by the way, I'd like to SWAT [22:59:30] i don't know, but touching all commands now ... [22:59:38] MaxSem: i'm done, go ahead [22:59:53] mutante: most of them would be empty files now, I think [23:00:02] mutante: it was clearly a puppet syntax error [23:00:04] RoanKattouw, ^d, marktraceur, MaxSem, ebernhardson: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140925T2300). Please do the needful. [23:00:11] mutante: if you look at the patch, $config_source wasn't defined anywhere [23:00:12] SadPanda: yes, the are almost all empty.. [23:00:20] mutante: yeah, and this should fix it [23:00:27] wow.. how did it not break earlier [23:00:36] indeed, I've no idea [23:01:42] (03CR) 10MaxSem: [C: 032] Turn off WikiGrok experiment pending fix for Bug 71335 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163048 (owner: 10Kaldari) [23:01:49] (03Merged) 10jenkins-bot: Turn off WikiGrok experiment pending fix for Bug 71335 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163048 (owner: 10Kaldari) [23:02:25] (03PS1) 10BryanDavis: beta: Remove Apache::Conf['hhvm_catchall'] from mediawiki::web::beta_sites [puppet] - 10https://gerrit.wikimedia.org/r/163054 [23:02:52] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/163048 (duration: 00m 03s) [23:02:57] Logged the message, Master [23:02:58] kaldari, ^^^ [23:03:23] (03CR) 10Dzahn: [C: 032] "root@neon:/etc/icinga/commands# file *" [puppet] - 10https://gerrit.wikimedia.org/r/163053 (owner: 10Yuvipanda) [23:04:42] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt to replace local commit." [puppet] - 10https://gerrit.wikimedia.org/r/163054 (owner: 10BryanDavis) [23:06:36] (03PS3) 10BBlack: Move remaining traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162752 [23:07:03] (03CR) 10BBlack: [C: 032 V: 032] Move remaining traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162752 (owner: 10BBlack) [23:07:35] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [23:08:18] !log maxsem Synchronized php-1.24wmf22/includes/api/ApiQueryAllUsers.php: https://gerrit.wikimedia.org/r/#/c/163026/ (duration: 00m 03s) [23:08:22] GB => esams, # United Kingdom [23:08:24] Logged the message, Master [23:08:25] ok, who wrote this. [23:09:00] Krenair: ? [23:09:22] that line I posted from the config-geo file in the DNS repo [23:09:26] Warning: Duplicate definition found for command 'check_to_check_nagios_paging' (config file '/etc/icinga/commands/check_to_check_nagios_paging.cfg', starting on line 2) [23:09:30] Error: Could not add object property in file '/etc/icinga/commands/check_to_check_nagios_paging.cfg' on line 3. [23:09:32] What's wrong with it? [23:09:40] GB # United Kingdom [23:09:47] errr, # Scotland ? ;P [23:09:48] SadPanda: now this ^ [23:09:54] Krenair: Semantics [23:10:00] mutante: looking [23:10:10] bah, andrewbogott_afk is afk now [23:10:22] yep [23:10:23] !log maxsem Synchronized php-1.25wmf1/includes/api/ApiQueryAllUsers.php: https://gerrit.wikimedia.org/r/#/c/163027/ (duration: 00m 03s) [23:10:28] Logged the message, Master [23:10:51] Reedy, API fix to both branches deployed:) [23:11:12] * Reedy will look in ganglia in a few minutes after old queries have "finished" [23:11:44] Krenair: the ISO sets that stuff up, notme [23:11:46] http://en.wikipedia.org/wiki/ISO_3166-2:GB [23:11:57] Krenair: Presumably, it's GEOIP "at fault" for using GB [23:12:07] mutante: is that the only error? [23:12:47] !log maxsem Synchronized php-1.25wmf1/includes/resourceloader/ResourceLoaderSiteModule.php: https://gerrit.wikimedia.org/r/#/c/163024/ (duration: 00m 03s) [23:12:49] no, the ISO is at fault. MaxMind, and thus gdnsd, and thus our config files, are all following in line with ISO [23:12:52] Logged the message, Master [23:12:57] (03PS1) 10Yuvipanda: nagios_common: Remove duplicate paging definition [puppet] - 10https://gerrit.wikimedia.org/r/163058 [23:13:07] SadPanda: i'm not sure, because it doesnt show me number of WARNS and ERRORS now [23:13:16] mutante: icinga -v /etc/icinga/icinga.cfg? [23:13:17] it's like it fails before getting there [23:13:21] yes, of course [23:13:36] bblack: I was meaning it along the lines of it being "not you" [23:13:42] :) [23:13:47] SadPanda: "One or more problems" [23:14:03] mutante: I'm looking for duplicates now, submitting patches as I go [23:14:13] mutante: pushed one ^ [23:14:32] YuviPanda: i laughed at it saying main configuration file is typically '/usr/local/icinga/etc/icinga.cfg' [23:14:36] ok [23:15:31] !log maxsem Synchronized php-1.25wmf1/extensions/CentralAuth/: https://gerrit.wikimedia.org/r/#/c/162971/ (duration: 00m 04s) [23:15:35] Logged the message, Master [23:16:01] https://ganglia.wikimedia.org/latest/?c=MySQL%20eqiad&h=db1051.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [23:16:09] https://ganglia.wikimedia.org/latest/?c=MySQL%20eqiad&h=db1066.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [23:16:12] Big drop off is big [23:16:33] (03CR) 10Dzahn: [C: 032] "Error: Could not add object property in file '/etc/icinga/commands/check_to_check_nagios_paging.cfg' on line 3." [puppet] - 10https://gerrit.wikimedia.org/r/163058 (owner: 10Yuvipanda) [23:17:45] mutante: hmm, I think that should do it [23:17:51] let me verify anyway [23:22:17] mutante: how's the run going? [23:23:25] (03PS1) 10BryanDavis: more deps for beta cluster jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/163059 [23:23:32] YuviPanda: caching.. [23:24:05] mutante: this was also another silly error on my part, didn't remove the config from checkcommands when I moved it [23:24:33] Reedy: heh, nice graph [23:24:58] probably that whole user_name index was in memory and being scanned...eating up cpu [23:25:27] pesky "group in X" queries [23:26:48] YuviPanda: looks good now :) [23:26:57] mutante: yay :) [23:27:00] Service[icinga]: Triggered 'refresh' [23:27:17] that wasn't too bad. just two follow up patches, all due to silly errors. [23:27:22] * YuviPanda will review them more carefully next time [23:28:53] (03PS2) 10BryanDavis: beta: more deps for beta cluster jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/163059 [23:29:17] YuviPanda: one single file is still empty. it's notifycommands.cfg [23:30:42] YuviPanda: the wikidata check now uses $USER1$.. rescheduling [23:30:57] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt to replace local commit." [puppet] - 10https://gerrit.wikimedia.org/r/163059 (owner: 10BryanDavis) [23:31:14] !log maxsem Synchronized php-1.25wmf1/skins/Vector/: https://gerrit.wikimedia.org/r/#/c/163021/ (duration: 00m 03s) [23:31:20] Logged the message, Master [23:31:33] legoktm / duh, that's all - please test:) [23:31:43] * legoktm does [23:32:10] logo stuff looks good [23:32:14] mutante: Couple of easy beta-only puppet changes if you have time -- https://gerrit.wikimedia.org/r/#/c/163059/ https://gerrit.wikimedia.org/r/#/c/163054/ [23:32:18] will test the CA one in a minute [23:32:30] i don't, sorry [23:32:37] no worries [23:32:40] already over an hour late, fixing icinga [23:32:50] darn that YuviPanda [23:33:00] mutante: you can rm that, should be ok empty as well [23:33:02] * YuviPanda hangs head in shame [23:33:32] YuviPanda: sometimes we make things worse on the way to making them better :) [23:33:35] checking ssl cert thing now [23:34:25] bd808: :) there was one problem, which was masked by another problem, but all seem ok now [23:35:12] wikidatacheck if wikidata.org dispatch lag is higher than 2 minutesOK [23:35:35] mutante: yay [23:35:38] mutante: thanks a lot [23:36:38] MaxSem: confirmed the CA change. thanks! [23:36:46] whee:) [23:41:01] (03CR) 10BryanDavis: "Bump. I agree with Mark that adding to production puppetmasters would be premature, but this patch only creates the proper config and appl" [puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) (owner: 10BryanDavis) [23:52:28] (03CR) 10BryanDavis: "I think this is working around bugs in the puppet manifest more than bugs in Trebuchet. I may be wrong though." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/162811 (owner: 10Catrope) [23:53:58] (03CR) 10Dzahn: [C: 032] beta: more deps for beta cluster jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/163059 (owner: 10BryanDavis) [23:54:38] (03CR) 10Dzahn: [C: 032] beta: Remove Apache::Conf['hhvm_catchall'] from mediawiki::web::beta_sites [puppet] - 10https://gerrit.wikimedia.org/r/163054 (owner: 10BryanDavis) [23:55:28] (03CR) 10Dzahn: "thanks for tracking down the puppet failures" [puppet] - 10https://gerrit.wikimedia.org/r/163054 (owner: 10BryanDavis) [23:58:05] Thanks for the merges mutante. Beta is down to 5 local commits and 2 of those are things that we will never upstream to the main repo. :)