[00:01:26] (03CR) 10BBlack: [C: 032] varnish does not like duplicate names for a director + backend [puppet] - 10https://gerrit.wikimedia.org/r/206022 (owner: 10BBlack) [00:01:34] (03CR) 10BryanDavis: "Cherry-picked and applied on beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/205968 (owner: 10BryanDavis) [00:01:45] Krenair: Cool, just let me know when you’re done then [00:02:06] Waiting for Jenkins again unfortunately. [00:02:17] To be fair it's faster than it used to be [00:03:30] (03Merged) 10jenkins-bot: Revert "Add editcontentmodel on testwiki temporarily for sysop" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206024 (owner: 10Mattflaschen) [00:03:57] (03PS2) 10BryanDavis: logstash: Convert $::realm switches to hiera [puppet] - 10https://gerrit.wikimedia.org/r/205969 [00:04:36] PROBLEM - puppet last run on cp1055 is CRITICAL Puppet has 1 failures [00:05:46] !log krenair Synchronized php-1.26wmf2/extensions/ZeroBanner/includes/ZeroSpecialPage.php: https://gerrit.wikimedia.org/r/#/c/206023/ (duration: 00m 13s) [00:05:51] Logged the message, Master [00:07:21] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/206024 (duration: 00m 14s) [00:07:22] superm401, kaldari done [00:07:25] Logged the message, Master [00:07:26] (03PS1) 10Dzahn: allow role-based hiera lookup on terbium [puppet] - 10https://gerrit.wikimedia.org/r/206025 [00:07:39] Thanks [00:08:28] (03PS2) 10Dzahn: allow role-based hiera lookup on terbium [puppet] - 10https://gerrit.wikimedia.org/r/206025 [00:08:34] 6operations, 10Traffic: Reboot caches for kernel 3.19.3 globally - https://phabricator.wikimedia.org/T96854#1229837 (10BBlack) I'd like to get the varnish 3.0.7 package built and ready as well, to coalesce the service restart with the reboot, basically (see blocking task). Also, the caches don't have the new... [00:10:31] thanks [00:10:54] (03PS1) 10Dzahn: allow role-based hiera lookup on tin [puppet] - 10https://gerrit.wikimedia.org/r/206026 [00:12:16] (03CR) 10Kaldari: [C: 032] Turning on WikiGrok on English Wikipedia (for 2 week test) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206009 (owner: 10Kaldari) [00:12:21] (03Merged) 10jenkins-bot: Turning on WikiGrok on English Wikipedia (for 2 week test) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206009 (owner: 10Kaldari) [00:14:12] (03PS1) 10Dzahn: antimony: include both roles with role keyword [puppet] - 10https://gerrit.wikimedia.org/r/206027 [00:14:18] (03PS3) 10BryanDavis: logstash: Convert $::realm switches to hiera [puppet] - 10https://gerrit.wikimedia.org/r/205969 [00:14:25] (Btw, what I said earlier about OSM and wmf3 was wrong. It made it before the branch cut) [00:15:15] !log kaldari Synchronized wmf-config/InitialiseSettings.php: Turning on WikiGrok on English Wikipedia (for 2 week test) (duration: 00m 11s) [00:15:18] Logged the message, Master [00:15:44] (03PS1) 10Dzahn: argon: use role keyword for hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/206028 [00:16:17] I’m all done [00:17:35] (03PS1) 10Dzahn: bast4001: use role keyword for all roles [puppet] - 10https://gerrit.wikimedia.org/r/206029 [00:18:30] (03CR) 10Dereckson: [C: 031] Change project name to 'Wikipedia' at astwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201897 (https://phabricator.wikimedia.org/T94341) (owner: 10Glaisher) [00:21:16] (03PS1) 10Kaldari: Revert "Turning on WikiGrok on English Wikipedia (for 2 week test)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206030 [00:22:15] (03PS1) 10Dzahn: horizon: use role keyword on californium [puppet] - 10https://gerrit.wikimedia.org/r/206031 [00:23:09] hmm, looks like the new tables were never initialized on en.wiki. reverting for now :( [00:23:09] kaldari, couldn't you just create the tables? [00:23:26] Krenair: I’ve never created tables on prod [00:23:54] https://wikitech.wikimedia.org/wiki/How_to_do_a_schema_change [00:24:05] RECOVERY - puppet last run on cp1055 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [00:24:37] (03PS1) 10Dzahn: carbon: use role keyword to include installserver [puppet] - 10https://gerrit.wikimedia.org/r/206032 [00:24:50] Krenair: OK, looks easy enough [00:25:23] yep [00:26:01] Krenair: wanna hang out for a minute while I try this :) [00:26:57] I can't do a call right now if that's what you mean, sorry [00:27:57] no problem. So I’m going to first run ‘mwscript sql.php --wiki=enwiki php-1.26wmf3/extensions/WikiGrok/sql/wikigrok_responses.sql’ from /srv/mediawiki-staging. Does that sound right? [00:28:10] (03PS1) 10Dzahn: eventlog1001: include all roles with role keyword [puppet] - 10https://gerrit.wikimedia.org/r/206036 [00:28:37] just want a sanity check before I pull the trigger :) [00:28:46] (03CR) 10BryanDavis: "Cherry-picked and applied on beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/205969 (owner: 10BryanDavis) [00:29:01] If that's the right file, it looks OK [00:29:37] OK, I’ll check on terbium now [00:30:10] Krenair: cool, looks like it worked :) [00:30:37] :) [00:32:18] (03PS2) 10BryanDavis: logstash: Provision Elasticsearch only backend hosts [puppet] - 10https://gerrit.wikimedia.org/r/205970 (https://phabricator.wikimedia.org/T96814) [00:33:14] (03CR) 10Dereckson: [C: 04-1] Set $wgRateLimits['badcaptcha'] to counter bots (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [00:34:31] (03PS3) 10BryanDavis: logstash: Provision Elasticsearch only backend hosts [puppet] - 10https://gerrit.wikimedia.org/r/205970 (https://phabricator.wikimedia.org/T96814) [00:36:59] (03PS1) 10Dzahn: gallium: use role keyword to include roles [puppet] - 10https://gerrit.wikimedia.org/r/206038 [00:43:25] (03PS1) 10Dzahn: role::package::builder -> ganglia_new hiera [puppet] - 10https://gerrit.wikimedia.org/r/206039 [00:43:57] (03PS2) 10Dzahn: role::package::builder -> ganglia_new hiera [puppet] - 10https://gerrit.wikimedia.org/r/206039 (https://phabricator.wikimedia.org/T93776) [00:44:18] (03CR) 10Tim Landscheidt: [C: 04-1] "As said, in principle this moves in the right direction. But toollabs::mailrelay includes toollabs which overwrites /etc/exim4/exim4.conf" [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [00:45:09] (03CR) 10Dzahn: [C: 032] role::package::builder -> ganglia_new hiera [puppet] - 10https://gerrit.wikimedia.org/r/206039 (https://phabricator.wikimedia.org/T93776) (owner: 10Dzahn) [00:45:24] (03PS2) 10BryanDavis: logstash: Convert Elasticsearch on logstash100[1-3] to client [puppet] - 10https://gerrit.wikimedia.org/r/205971 (https://phabricator.wikimedia.org/T96814) [00:45:46] (03CR) 10BryanDavis: [C: 04-1] "This needs to wait until we have attached logstash100[4-6] to the Elasticsearch cluster and moved all of the shards over." [puppet] - 10https://gerrit.wikimedia.org/r/205971 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [00:46:11] mutante: haha, your last few commits look like: https://www.dropbox.com/s/v1o2p3znb5gezan/Screenshot%202015-04-22%2017.45.58.png?dl=0 [00:47:13] YuviPanda: there's a :package: for you at office [00:47:20] :D [00:47:24] am i doing it right [00:47:27] yes you are :) [00:47:32] also - is there really a package? :) [00:47:36] no:) [00:47:54] well, not that i know of [00:48:26] so what else can it do ? :cow: :smile: :cat: [00:49:09] :poop: [00:50:14] hmm, the :package: change doesn't do what i expected it to do [00:51:10] 6operations, 10Traffic: VCL support for Last-Access cookie - https://phabricator.wikimedia.org/T96861#1229917 (10Nuria) Nicee, thank you, adding analytics-kanban to this task! [00:51:16] expected role/common/package/builder.yaml = role package::builder [00:51:34] 6operations, 6Analytics-Kanban, 10Traffic: VCL support for Last-Access cookie - https://phabricator.wikimedia.org/T96861#1229919 (10Nuria) [01:05:53] (03PS1) 10Dzahn: helium: use role keyword to include roles [puppet] - 10https://gerrit.wikimedia.org/r/206045 [01:09:48] (03PS1) 10Dzahn: graphite nodes: use role keyword to include roles [puppet] - 10https://gerrit.wikimedia.org/r/206046 [01:12:54] (03PS1) 10Dzahn: ganglia: role::dumps -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206047 (https://phabricator.wikimedia.org/T93776) [01:15:38] (03CR) 10Dzahn: [C: 032] ganglia: role::dumps -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206047 (https://phabricator.wikimedia.org/T93776) (owner: 10Dzahn) [01:22:45] (03PS1) 10Dzahn: dataset1001 -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206050 [01:23:15] (03CR) 10Dzahn: [C: 032] dataset1001 -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206050 (owner: 10Dzahn) [01:24:55] (03CR) 10Dzahn: "yea, this switches it, but the role-based approach didnt work yet. why?" [puppet] - 10https://gerrit.wikimedia.org/r/206050 (owner: 10Dzahn) [01:27:16] (03CR) 10Dzahn: "@Giuseppe: Why does this approach to put it in the role not work, but host-based as in https://gerrit.wikimedia.org/r/#/c/206050/ does and" [puppet] - 10https://gerrit.wikimedia.org/r/206047 (https://phabricator.wikimedia.org/T93776) (owner: 10Dzahn) [01:32:58] (03PS1) 10Dzahn: ms1001 -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206052 [01:33:00] (03PS1) 10Dzahn: erbium -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206053 [01:33:02] (03PS1) 10Dzahn: bast1001 -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206054 [01:33:04] (03PS1) 10Dzahn: neon -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206055 [01:33:06] (03PS1) 10Dzahn: silver -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206056 [01:33:08] (03PS1) 10Dzahn: tin -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206057 [01:33:10] (03PS1) 10Dzahn: berkelium -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206058 [01:33:12] (03PS1) 10Dzahn: fluorine -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206059 [01:33:14] (03PS1) 10Dzahn: chromium -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206060 [01:33:16] (03PS1) 10Dzahn: curium -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206061 [01:33:18] (03PS1) 10Dzahn: gadolinium -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206062 [01:37:42] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1229990 (10Dzahn) switching over to ganglia_new per host: https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:ganglia-new,n,z in... [02:21:32] 6operations, 6Analytics-Kanban, 10Traffic: VCL support for Last-Access cookie - https://phabricator.wikimedia.org/T96861#1230025 (10BBlack) 5Open>3Invalid Sorry, I didn't realize (or forgot?) you already had a ticket for the VCL work as well at T92435. Closing up this one and continuing to use that one... [02:24:25] !log l10nupdate Synchronized php-1.26wmf2/cache/l10n: (no message) (duration: 05m 46s) [02:24:34] Logged the message, Master [02:28:42] !log LocalisationUpdate completed (1.26wmf2) at 2015-04-23 02:27:39+00:00 [02:28:46] Logged the message, Master [02:30:59] (03PS10) 10BBlack: Adding a Last-Access cookie to text and mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T88813) (owner: 10Nuria) [02:46:28] !log l10nupdate Synchronized php-1.26wmf3/cache/l10n: (no message) (duration: 03m 46s) [02:46:34] Logged the message, Master [02:49:43] !log LocalisationUpdate completed (1.26wmf3) at 2015-04-23 02:48:40+00:00 [02:49:47] Logged the message, Master [03:13:17] 6operations, 7HHVM: luasandbox is failing with HHVM 3.6 - https://phabricator.wikimedia.org/T96661#1230065 (10ori) 5Open>3Resolved a:3ori [03:28:06] 7Puppet, 6operations, 10Continuous-Integration: Puppet (silently) fails to setup apache on new trusty instances - https://phabricator.wikimedia.org/T91832#1230073 (10Krinkle) [03:28:44] 7Puppet, 10Browser-Tests: [Regression] QA: Puppet failing for Role::Ci::Slave::Browsertests/elasticsearch - https://phabricator.wikimedia.org/T74255#1230074 (10Krinkle) [03:36:29] 7Puppet, 10Browser-Tests: [Regression] QA: Puppet failing for Role::Ci::Slave::Browsertests/elasticsearch - https://phabricator.wikimedia.org/T74255#1230094 (10Krinkle) 5Open>3Resolved a:3Krinkle Haven't seen this error in the 2 instance re-creation sprints. Works for me. [04:06:54] (03CR) 10Mattflaschen: [C: 031] "This will enable a beta feature that (when turned on) sets $wgUseMediaWikiUIEverywhere to true?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson) [04:07:15] (03CR) 10Mattflaschen: "After it's done with Labs, I also support doing it in prod." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson) [04:15:19] (03CR) 10Glaisher: [C: 04-1] "Also set $wgVectorBetaFormRefresh = $wmgVectorBetaFormRefresh; in CommonSettings-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 (owner: 10Jdlrobson) [04:17:30] beta cluster is down [04:17:36] (Cannot access the database: Can't connect to MySQL server on '10.68.16.193' (4) (10.68.16.193)) [04:48:45] PROBLEM - puppet last run on cp3037 is CRITICAL puppet fail [04:56:49] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1230312 (10Ijon) Thanks, @Glaisher. Relayed. [05:06:45] RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [05:22:20] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Apr 23 05:21:17 UTC 2015 (duration 21m 16s) [05:22:28] Logged the message, Master [05:31:50] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1230328 (10Ijon) Done. :) https://translatewiki.net/w/i.php?language=gom-deva&module=namespace&title=Special%3AAdvancedTranslate And SUL seems {{done}} too. :) All systems go? [05:46:35] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:48:06] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [06:24:36] PROBLEM - nova-compute process on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:24:36] PROBLEM - salt-minion processes on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:26:05] RECOVERY - salt-minion processes on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:26:05] RECOVERY - nova-compute process on labvirt1005 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [06:30:24] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures [06:30:45] PROBLEM - puppet last run on mw1060 is CRITICAL Puppet has 1 failures [06:31:15] PROBLEM - puppet last run on mc2011 is CRITICAL Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on cp3037 is CRITICAL Puppet has 2 failures [06:32:04] PROBLEM - puppet last run on cp4014 is CRITICAL Puppet has 1 failures [06:32:05] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 1 failures [06:32:15] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures [06:34:14] PROBLEM - puppet last run on mw1166 is CRITICAL Puppet has 1 failures [06:34:25] PROBLEM - puppet last run on mw1144 is CRITICAL Puppet has 1 failures [06:34:35] PROBLEM - dhclient process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:35:04] PROBLEM - puppet last run on mw1114 is CRITICAL Puppet has 1 failures [06:35:26] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures [06:35:36] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:35:36] PROBLEM - puppet last run on mw1025 is CRITICAL Puppet has 2 failures [06:35:45] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 1 failures [06:35:54] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:36:05] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [06:36:05] RECOVERY - dhclient process on labvirt1006 is OK: PROCS OK: 0 processes with command name dhclient [06:36:24] PROBLEM - puppet last run on mw2003 is CRITICAL Puppet has 1 failures [06:45:24] RECOVERY - puppet last run on mw1060 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:04] RECOVERY - puppet last run on mc2011 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:46:45] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:54] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:56] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:56] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:05] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:47:06] RECOVERY - puppet last run on mw1025 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:14] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:47:15] RECOVERY - puppet last run on mw1166 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:47:24] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:26] RECOVERY - puppet last run on mw1144 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:47:34] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:55] RECOVERY - puppet last run on mw2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:05] RECOVERY - puppet last run on mw1114 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:15] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:50:31] (03CR) 10Nemo bis: "Dereckson, when you retry 10 times, does it really take only 5 (or 3) seconds per captcha? If so, ok." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [06:57:57] (03PS3) 10Gilles: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) [07:31:04] (03PS1) 10Mjbmr: Add abusefilter-modify-restricted right to sysop user group for idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206080 (https://phabricator.wikimedia.org/T96542) [07:46:46] PROBLEM - salt-minion processes on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:15] RECOVERY - salt-minion processes on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:54:45] PROBLEM - puppet last run on mw2159 is CRITICAL puppet fail [08:06:33] (03PS1) 10Yuvipanda: tools: Generate packages.json for cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/206082 [08:06:55] (03PS2) 10Yuvipanda: tools: Generate packages.json for cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/206082 (https://phabricator.wikimedia.org/T96799) [08:07:11] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Generate packages.json for cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/206082 (https://phabricator.wikimedia.org/T96799) (owner: 10Yuvipanda) [08:14:25] RECOVERY - puppet last run on mw2159 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:15:29] hi [08:15:30] https://commons.wikimedia.org/wiki/File:Rhode_Island_State_House,_90_Smith_Street,_Providence,_Providence_County,_RI_HABS_RI,4-PROV,180-8.tif [08:15:38] file disappeared [08:16:01] idem here https://commons.wikimedia.org/wiki/File:Rhode_Island_State_House,_90_Smith_Street,_Providence,_Providence_County,_RI_HABS_RI,4-PROV,180-9.tif [08:16:14] can it be recovered? [08:19:41] nevermind, reuploaded [08:23:33] (03PS1) 10Yuvipanda: tools: Redirect /cdnjs from tools-static to tools [puppet] - 10https://gerrit.wikimedia.org/r/206084 (https://phabricator.wikimedia.org/T96799) [08:23:55] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Redirect /cdnjs from tools-static to tools [puppet] - 10https://gerrit.wikimedia.org/r/206084 (https://phabricator.wikimedia.org/T96799) (owner: 10Yuvipanda) [08:30:15] (03PS10) 10Filippo Giunchedi: graphite: introduce carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/181080 (https://phabricator.wikimedia.org/T85908) [08:30:22] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: introduce carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/181080 (https://phabricator.wikimedia.org/T85908) (owner: 10Filippo Giunchedi) [08:31:19] !log restart carbon on labmon1001, replace with carbon-c-relay [08:31:25] Logged the message, Master [08:38:21] (03CR) 10Mjbmr: "I apologizes for my bad attitude. Please get this ready as the community is waiting for this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204473 (https://phabricator.wikimedia.org/T93339) (owner: 10Mjbmr) [08:42:58] (03CR) 10Mjbmr: "The link I provided in previous comments is in the archive and the discussion was closed 15th March, and I know their language the discuss" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204467 (https://phabricator.wikimedia.org/T92760) (owner: 10Mjbmr) [08:54:14] PROBLEM - nutcracker port on silver is CRITICAL - Socket timeout after 2 seconds [08:55:34] PROBLEM - puppet last run on cp3049 is CRITICAL puppet fail [08:55:45] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [08:59:39] 6operations, 10RESTBase, 10Traffic, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1230755 (10akosiaris) >>! In T95229#1228222, @GWicke wrote: > The problem with 'content' is that it is not general enough, as this API will expose... [09:09:26] PROBLEM - dhclient process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:10:55] RECOVERY - dhclient process on labvirt1006 is OK: PROCS OK: 0 processes with command name dhclient [09:13:25] RECOVERY - puppet last run on cp3049 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:15:49] !log restart carbon on graphite1001, replace with carbon-c-relay [09:15:55] Logged the message, Master [09:16:05] PROBLEM - puppet last run on mw1001 is CRITICAL puppet fail [09:20:32] (03PS1) 10KartikMistry: Beta: Enable Content Translation for Czech (cs) [puppet] - 10https://gerrit.wikimedia.org/r/206090 (https://phabricator.wikimedia.org/T96486) [09:27:37] (03CR) 10Merlijn van Deen: "OK, makes sense. I have to figure out puppet a bit further, because I /really/ have no idea about the execution order. I would think that " [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [09:29:35] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/185321 (owner: 10Dzahn) [09:32:45] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:32:45] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:33:45] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:34:05] RECOVERY - puppet last run on mw1001 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:34:14] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:34:15] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:34:15] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:37:25] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [09:37:34] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [09:37:35] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [09:37:35] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [09:38:44] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [09:39:05] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [09:40:12] (lots of UNKNOWNs in icinga is me, should be recovering now) [09:44:47] <_joe_> ulsfo troubles? [09:44:52] <_joe_> oh ok [10:00:57] _joe_, akosiaris: should we s/include role::/role / in https://gerrit.wikimedia.org/r/#/c/205350/3/manifests/role/sca.pp ? [10:01:20] <_joe_> mobrovac: nope [10:01:29] <_joe_> role 'foo' only goes at the node scope [10:01:36] ah ok [10:01:41] good to know [10:01:43] thnx _joe_ [10:02:58] akosiaris: regarding splitting the graphoid patch in three, ok if i first change stuff based on your comments, you do another round of checks and then we split it? [10:03:07] or you want them split asap/ [10:03:09] ? [10:04:33] (03PS1) 10Mjbmr: Enable assigning 'accountcreator' for newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206093 (https://phabricator.wikimedia.org/T96824) [10:06:05] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:07:35] PROBLEM - salt-minion processes on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:07:35] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [10:09:06] (03PS2) 10Mjbmr: Enable assigning 'accountcreator' for newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206093 (https://phabricator.wikimedia.org/T96824) [10:09:15] RECOVERY - salt-minion processes on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:09:37] (03CR) 10Mobrovac: WIP: Graphoid: Puppet bits (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/205350 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [10:09:48] (03PS4) 10Mobrovac: WIP: Graphoid: Puppet bits [puppet] - 10https://gerrit.wikimedia.org/r/205350 (https://phabricator.wikimedia.org/T90487) [10:10:53] (03CR) 10Mobrovac: "@akosiaris, addressed your comments modulo patch-split, will do that next" [puppet] - 10https://gerrit.wikimedia.org/r/205350 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [10:11:03] <_joe_> mobrovac: amend this patch first [10:11:16] _joe_: yup, did so ^^ [10:11:23] <_joe_> having "everything needed for a new service" in one patch will help a lot our work [10:11:28] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1230839 (10Aklapper) [10:12:05] _joe_: should I then abandon this patch so that we keep it for historical reasons and create completely new patches for the split ? [10:12:11] (to make it easier) [10:12:21] <_joe_> mobrovac: maybe, something like that :) [10:12:29] k [10:23:26] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:24:25] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:54] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [10:26:56] 6operations: Encrypted password storage - https://phabricator.wikimedia.org/T96130#1230867 (10Joe) >>! In T96130#1224524, @MoritzMuehlenhoff wrote: > > (Note that I could only test on Linux, wrt MacOS I can only make an informed guess that both will work fine. Real world testing of > pws on a Mac would be very... [10:28:25] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [10:36:57] (03Abandoned) 10Mobrovac: WIP: Graphoid: Puppet bits [puppet] - 10https://gerrit.wikimedia.org/r/205350 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [10:41:34] PROBLEM - nova-compute process on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:44:45] RECOVERY - nova-compute process on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [10:49:44] PROBLEM - salt-minion processes on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:51:16] RECOVERY - salt-minion processes on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:04:45] PROBLEM - nova-compute process on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:06:15] PROBLEM - salt-minion processes on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:07:54] RECOVERY - salt-minion processes on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:07:54] RECOVERY - nova-compute process on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [11:09:03] springle: do we leave innodb_lock_wait_timeout to the default? [11:21:08] PROBLEM - dhclient process on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:21:14] PROBLEM - salt-minion processes on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:22:35] RECOVERY - dhclient process on labvirt1005 is OK: PROCS OK: 0 processes with command name dhclient [11:22:54] RECOVERY - salt-minion processes on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:23:25] PROBLEM - dhclient process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:24:54] RECOVERY - dhclient process on labvirt1006 is OK: PROCS OK: 0 processes with command name dhclient [11:36:06] PROBLEM - nova-compute process on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:37:54] RECOVERY - nova-compute process on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [11:45:34] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [12:02:06] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:16:12] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Enable Content Translation for Czech (cs) [puppet] - 10https://gerrit.wikimedia.org/r/206090 (https://phabricator.wikimedia.org/T96486) (owner: 10KartikMistry) [12:29:29] !log investigating icinga UNKNOWN for hhvm queue/threads [12:29:35] Logged the message, Master [12:34:15] PROBLEM - salt-minion processes on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:35:45] RECOVERY - salt-minion processes on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:40:24] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1231011 (10Nemo_bis) [12:43:15] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1231017 (10faidon) So, assuming servers with single-instance Cassandra nodes, how does the capacity/procurement planning looks like for a) short-term (this FY), b) long-term (next FY, up... [12:43:33] (03PS1) 10Mobrovac: Graphoid: service deployment on SCA [puppet] - 10https://gerrit.wikimedia.org/r/206105 (https://phabricator.wikimedia.org/T90487) [12:45:48] (03PS2) 10Mobrovac: Graphoid: service deployment on SCA [puppet] - 10https://gerrit.wikimedia.org/r/206105 (https://phabricator.wikimedia.org/T90487) [12:46:46] (03PS4) 10Cmjohnson: Adding dhcpd entries for new logstash1004-6 (T96692) [puppet] - 10https://gerrit.wikimedia.org/r/205901 [12:46:58] (03PS3) 10Mobrovac: Graphoid: service deployment on SCA [puppet] - 10https://gerrit.wikimedia.org/r/206105 (https://phabricator.wikimedia.org/T90487) [12:56:04] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:56:05] PROBLEM - dhclient process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:56:35] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:56:35] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:56:35] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:56:35] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:58:34] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1231046 (10Glaisher) Should hi be the fallback language for gom-deva? [12:59:35] RECOVERY - dhclient process on labvirt1006 is OK: PROCS OK: 0 processes with command name dhclient [12:59:40] (03PS1) 10Mobrovac: Graphoid: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/206106 (https://phabricator.wikimedia.org/T90487) [13:01:15] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [13:01:45] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [13:01:45] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [13:01:45] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [13:01:45] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [13:02:45] PROBLEM - nova-compute process on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:53] (03PS1) 10Mobrovac: Graphoid: Varnish configuration [puppet] - 10https://gerrit.wikimedia.org/r/206108 (https://phabricator.wikimedia.org/T90487) [13:05:54] RECOVERY - nova-compute process on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [13:07:35] akosiaris: https://gerrit.wikimedia.org/r/#/c/206105/ and deps ready [13:07:37] enjoy :) [13:09:30] (03PS1) 10Filippo Giunchedi: graphite: write local metrics using consistent hashing [puppet] - 10https://gerrit.wikimedia.org/r/206109 [13:10:15] mobrovac: can you include me too in those reviews? mostly so they show up in inbox [13:10:24] yup [13:10:40] thanks! [13:10:57] godog: done [13:11:20] \o/ [13:12:00] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: write local metrics using consistent hashing [puppet] - 10https://gerrit.wikimedia.org/r/206109 (owner: 10Filippo Giunchedi) [13:15:17] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1231061 (10Aklapper) [13:17:05] PROBLEM - statsite backend instances on graphite1001 is CRITICAL Not all configured statsite instances are running. [13:17:26] PROBLEM - dhclient process on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:20:36] RECOVERY - statsite backend instances on graphite1001 is OK All defined statsite jobs are runnning. [13:20:45] RECOVERY - dhclient process on labvirt1005 is OK: PROCS OK: 0 processes with command name dhclient [13:22:05] PROBLEM - puppet last run on cp3020 is CRITICAL puppet fail [13:22:08] 6operations, 10RESTBase, 10Traffic, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1231079 (10Anomie) >>! In T95229#1230755, @akosiaris wrote: > TL;DR: `/api/_/v1/` Eeew. > Thoughts ? I have serious doubts about encouraging thi... [13:25:56] PROBLEM - dhclient process on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:24] (03PS1) 10Mobrovac: service::node: Deps fix + deployment-prep hieradata [puppet] - 10https://gerrit.wikimedia.org/r/206115 [13:28:27] 6operations, 10ops-eqiad, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1231095 (10Cmjohnson) The setup is complete and can be installed at any time. They are set to install jessie. Please let me know if you want to change to precis... [13:29:04] 6operations, 5Patch-For-Review: adjust CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1231101 (10fgiunchedi) 5Resolved>3Open reopening, the warning still comes up in icinga because we show both SOFT and HARD states, not sure how to best fix that. ideally only hard states would be shown [13:29:06] RECOVERY - dhclient process on labvirt1005 is OK: PROCS OK: 0 processes with command name dhclient [13:29:32] _joe_: minor service::node improvements here - https://gerrit.wikimedia.org/r/#/c/206115/ [13:32:18] 6operations, 10Traffic: Refactor varnish puppet config - https://phabricator.wikimedia.org/T96847#1231116 (10BBlack) Saved for posterity, as it's relevant to text/mobile refactors for vcl_recv/vcl_deliver later, although I may partially address this in a cache-cleanup branch commit today in order to unblock fu... [13:35:07] (03PS1) 10coren: Labs: make reboot-if-idmap smarter [puppet] - 10https://gerrit.wikimedia.org/r/206116 [13:35:16] andrewbogott: ^^ [13:36:15] PROBLEM - salt-minion processes on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:36:15] PROBLEM - nova-compute process on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:54] RECOVERY - salt-minion processes on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:37:54] RECOVERY - nova-compute process on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [13:38:01] (03CR) 10Andrew Bogott: [C: 031] Labs: make reboot-if-idmap smarter [puppet] - 10https://gerrit.wikimedia.org/r/206116 (owner: 10coren) [13:38:45] RECOVERY - puppet last run on cp3020 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:41:32] Coren: seems like something may have went awry with the deployment-prep update. Seems like its throwing lots of nslcd "error writing to client: Broken pipe" while looking for &(objectClass=posixGroup)(gidNumber=50062) which seems to be the project-bastion group [13:42:23] also, generally, shinken is blowing up in -releng [13:42:34] thcipriani|afk: That doesn't seem related to NFS at all - that sounds like LDAP woes [13:42:46] PROBLEM - OCG health on ocg1002 is CRITICAL ocg_job_status 738315 msg: ocg_render_job_queue 7064 msg (=3000 critical) [13:43:14] PROBLEM - OCG health on ocg1001 is CRITICAL ocg_job_status 739320 msg: ocg_render_job_queue 7514 msg (=3000 critical) [13:43:14] PROBLEM - OCG health on ocg1003 is CRITICAL ocg_job_status 739320 msg: ocg_render_job_queue 7514 msg (=3000 critical) [13:45:32] hmm, this also didn't get started until I called it a day last night. I've tried stopping and restarting nslcd, reran it in debug mode to find the faulty gidNumber. [13:46:32] I'll check in -labs, thanks [13:48:18] (03CR) 10coren: [C: 032] Labs: make reboot-if-idmap smarter [puppet] - 10https://gerrit.wikimedia.org/r/206116 (owner: 10coren) [13:49:17] mmhh ocg machines aren't happy, load is through the roof [13:49:38] many xelatex processes, in addition to ocg of course [13:49:55] PROBLEM - salt-minion processes on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:55] PROBLEM - nova-compute process on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:50:31] <_joe_> what's up with ocg? [13:50:58] <_joe_> godog: that's an old classic, it wasn't happening since some time though [13:51:25] _joe_: ah, is that benign? [13:52:27] <_joe_> nope [13:52:45] <_joe_> not benign, but usually a restart of the service is enough [13:52:53] <_joe_> but lemme take a look [13:52:57] 6operations, 10Traffic: Package/backport openssl 1.0.2 + nginx 1.7.x or higher - https://phabricator.wikimedia.org/T96850#1227475 (10BBlack) Waiting for the (relatively-imminent) official Jessie release before making any further decisions here... [13:53:52] <_joe_> godog: something strange happened around 13:40 UTC though [13:54:23] <_joe_> someone requested a ton of pdfs, it would seem [13:54:45] RECOVERY - salt-minion processes on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:54:45] RECOVERY - nova-compute process on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [13:55:53] indeed, the queue has many items [13:56:19] <_joe_> are you inspecting it already? [13:57:06] not yet no, looking up how to do that [13:57:20] cscott: ^ [13:57:56] <_joe_> godog: I guess looking at the ocg code could be a good idea, or maybe, lemme take a look [14:01:15] RECOVERY - OCG health on ocg1002 is OK ocg_job_status 745803 msg: ocg_render_job_queue 387 msg [14:01:45] RECOVERY - OCG health on ocg1001 is OK ocg_job_status 745856 msg: ocg_render_job_queue 55 msg [14:01:45] RECOVERY - OCG health on ocg1003 is OK ocg_job_status 745856 msg: ocg_render_job_queue 54 msg [14:01:54] godog, _joe_: i'm here. [14:02:03] can you recap? [14:03:45] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:04:19] cscott: hi, we got icinga alerts for ocg queue, looks like it is clearing it now but we'd like to understand were it came from [14:04:33] well yeah the queue is clear [14:04:43] did you clear it manually, or did it resolve itself? [14:04:54] it chugged through the queue by itself [14:05:15] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:05:35] yeah, i do see a 20 minute long spike in https://ganglia.wikimedia.org/latest/?c=PDF%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [14:06:06] corresponds to a spike in incoming network, so probably triggered by an outside user, not some internal problem? [14:06:23] i can look through the logs to see what it was they wanted so badly [14:07:06] yep that'd be great! [14:07:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor comments inline. Btw, the next debian version following jessie might not have a nodejs-legacy package any longer. But it is a bridge" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/206115 (owner: 10Mobrovac) [14:07:33] looks like the job flood was only 5 minutes long, it then took 15 minutes to work through it. [14:08:34] it's quite possible the icinga warning levels are still a bit too touchy, this probably wasn't really a "critical". [14:08:35] <_joe_> cscott: I tried to peek at the queue, but by the time I was ready to do it, it was gone [14:08:56] <_joe_> cscott: it was, cpu to 80% and a growing queue [14:09:08] <_joe_> so I'd say they're pretty well tuned [14:09:35] hm, but i'd say a "CRITICAL" notification should be one which brings down the service [14:10:30] but yeah, hard to nail it down further without looking at the shape of the graph [14:11:21] anyway, logstash has the job details, looking through that now. [14:12:44] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [14:14:26] (03PS1) 10Merlijn van Deen: Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/206118 [14:15:10] (03CR) 10jenkins-bot: [V: 04-1] Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/206118 (owner: 10Merlijn van Deen) [14:16:38] _joe_: looks like someone tried to download a bunch of wikiquote as PDFs [14:17:17] cscott: English Wikiquote? [14:17:28] Nemo_bis: yes [14:17:32] cscott: time_to_success/failure is milliseconds is it? [14:17:41] It is the only language edition full of images [14:17:55] yeah, that could slow things down [14:18:02] godog: in graphite? i think so. [14:18:19] cscott: yeah graphite [14:18:35] https://logstash.wikimedia.org/#dashboard/temp/zJ__tW5TTEGuSHujztAiBg has the requests for the period in question [14:18:45] well, the first part of them, at least. [14:20:43] (03PS2) 10Merlijn van Deen: Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/206118 [14:21:32] (03CR) 10jenkins-bot: [V: 04-1] Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/206118 (owner: 10Merlijn van Deen) [14:21:42] (03PS3) 10Merlijn van Deen: Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/206118 [14:22:24] (03CR) 10jenkins-bot: [V: 04-1] Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/206118 (owner: 10Merlijn van Deen) [14:23:47] (03PS4) 10Merlijn van Deen: Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/206118 [14:23:55] cscott: but yeah it'd be nice if there was an easy way to tell where job came from from the logs [14:24:20] godog: yeah, this has been a thorn in my side for a while: [14:25:07] the problem is that Extension:Collection lets you make custom collections for rendering, as well as rendering "saved books" or "just one article" [14:25:25] both of the latter have corresponding URLs that I would like to log, to tell where the problem came from [14:25:50] but because of the way the extension is written, we've lost the source information by the time we invoke OCG [14:25:52] (03CR) 10Mobrovac: "wrt nodejs-legacy, highly possible. We should switch our shebangs to /usr/bin/nodejs by then." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/206115 (owner: 10Mobrovac) [14:26:15] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [14:26:16] mobrovac: I 've updated https://www.mediawiki.org/wiki/Services/Meetings/2015-03-19-Ops a bit with some proposals on how this should move forward. I 'll start turning them into phab tickets and working on them. [14:26:49] akosiaris: nice, thnx! I put the updates today about service::node and the puppet bits :P [14:26:59] cscott: ah, is it tracked anywhere in phabricator? [14:27:41] godog: not yet. i should file a ticket. and i maintain Extension:Collection now (de facto, i'm the only person who has committed to it in months), so I suppose I could/should fix the problem at the source. [14:28:36] hehe that's easier for sure [14:30:07] so, seems to be an alphabetical download of all of wikiquote? [[Doubt]], followed by [[Duty]], etc. The initiator seems to have realized what they were doing and stopped it before it got past D. [14:30:18] 6operations, 10Traffic: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1231228 (10BBlack) 3NEW [14:30:24] <_joe_> cscott: ahah, nice [14:30:39] 6operations, 10Traffic: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1231235 (10BBlack) [14:30:41] 6operations: Choose a consistent, distributed k/v storage for configuration management/discovery - https://phabricator.wikimedia.org/T95656#1231236 (10BBlack) [14:30:56] 6operations, 10RESTBase, 10Traffic, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1231237 (10Tgr) How about mapping `/api/foo/v1` to `http://rest.wikimedia.org/{domain}/v1/foo` (for some whitelist or non-blacklist)? Right now, RE... [14:32:34] 6operations, 10Traffic: Reboot caches for kernel 3.19.3 globally - https://phabricator.wikimedia.org/T96854#1231243 (10BBlack) p:5High>3Low [14:33:23] you can scroll through the requests at https://logstash.wikimedia.org/#dashboard/temp/2qdQEJXlSvSEPZI9whuP0w [14:33:24] (03CR) 10Alexandros Kosiaris: service::node: Deps fix + deployment-prep hieradata (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/206115 (owner: 10Mobrovac) [14:33:52] 6operations, 10Traffic: Reboot caches for kernel 3.19.3 globally - https://phabricator.wikimedia.org/T96854#1231245 (10BBlack) p:5Low>3High [14:34:06] but you have to expand each "picking up job" message to look at the job.metabook field to see what they were attempting to render. like i said, more awkward than it should be. [14:35:11] looks like beta labs is teetering [14:36:15] (03CR) 10Manybubbles: logstash: Remove redis input (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/205968 (owner: 10BryanDavis) [14:36:38] cscott: thanks, yeah having that on the ocg page in wikitech would help, at least knowing where to look [14:37:33] godog: created https://phabricator.wikimedia.org/T97030, thanks for the push. [14:37:57] (03CR) 10Manybubbles: [C: 031] logstash: Provision Elasticsearch only backend hosts [puppet] - 10https://gerrit.wikimedia.org/r/205970 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [14:38:01] godog: just created https://logstash.wikimedia.org/#/dashboard/elasticsearch/OCG%20Backend%20requests and I'll make sure it ends up on the wiki page. [14:38:18] cscott: haha no problem, thanks! [14:39:47] (03CR) 10Manybubbles: [C: 031] "Fine by me. Do these machines have half decent disks in them worth maybe keeping older data on? We use allocation filtering for that if we" [puppet] - 10https://gerrit.wikimedia.org/r/205971 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [14:43:59] who's managing memberships in Phabricator? (I need to access a Phab ticket related to PCI DSS compliance on fund tech) [14:44:30] (03CR) 10Mobrovac: service::node: Deps fix + deployment-prep hieradata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/206115 (owner: 10Mobrovac) [14:44:40] (03PS2) 10Mobrovac: service::node: Deps fix + deployment-prep hieradata [puppet] - 10https://gerrit.wikimedia.org/r/206115 [14:45:44] PROBLEM - dhclient process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:47:24] RECOVERY - dhclient process on labvirt1006 is OK: PROCS OK: 0 processes with command name dhclient [14:48:15] (03CR) 10BryanDavis: logstash: Remove redis input (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/205968 (owner: 10BryanDavis) [14:51:55] (03CR) 10BryanDavis: "> Do these machines have half decent disks in them worth maybe keeping older data on?" [puppet] - 10https://gerrit.wikimedia.org/r/205971 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [14:52:25] (03CR) 10Manybubbles: [C: 031] logstash: Remove redis input (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/205968 (owner: 10BryanDavis) [14:53:03] moritzm: andre__ would be the best person to talk to [14:53:20] (03CR) 10Manybubbles: "> Their disks aren't bad, but the amount of ram they have is puny which leads to lots of OOM on query problems today. Our total Logstash r" [puppet] - 10https://gerrit.wikimedia.org/r/205971 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [14:53:20] eh? [14:54:08] moritzm: Yo! Ticket ID? And do you swear by your life that you've signed an NDA? :) [14:54:35] andre__: moritz is already in the nda group [14:54:38] fundraising does their own thing [14:54:48] what's the ticket /3? [14:54:50] #? [14:54:51] andre__: https://phabricator.wikimedia.org/T92579 [14:55:03] not even sure if I can help, let's see [14:55:11] the group is "WMF-NDA" [14:55:23] no, "WMF FR" [14:55:28] yeah, heh, well. I have no access either (though being admin). That's how Phab works :-/ [14:55:47] * andre__ wondering whether to make himself a member [14:55:48] moritzm: contact @atgo https://phabricator.wikimedia.org/p/atgo/ [14:55:56] yeah, contact Anne [14:55:56] chasemp: thanks, will do! [14:55:58] fundraising has financial things and such and phab admins are not all powerful [14:56:02] more like janitorial [14:57:54] (03CR) 10Manybubbles: [C: 031] Re-enable Special:SupportedLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204032 (https://phabricator.wikimedia.org/T54728) (owner: 10Nikerabbit) [14:59:56] PROBLEM - dhclient process on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150423T1500). Please do the needful. [15:00:20] anyone doing swat today/ [15:00:23] I can do it [15:00:26] its a light day :) [15:00:30] manybubbles: Go for it [15:00:34] anomie: cools [15:00:41] Nikerabbit: hi - swat time for you [15:00:50] manybubbles: I'm the only one? [15:01:01] Nikerabbit: yeah [15:01:03] ready? [15:01:12] manybubbles: super excited [15:01:17] (03CR) 10Manybubbles: [C: 032] Re-enable Special:SupportedLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204032 (https://phabricator.wikimedia.org/T54728) (owner: 10Nikerabbit) [15:01:19] oh, even better! [15:01:25] (03Merged) 10jenkins-bot: Re-enable Special:SupportedLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204032 (https://phabricator.wikimedia.org/T54728) (owner: 10Nikerabbit) [15:03:02] !log manybubbles Synchronized wmf-config/CommonSettings.php: swat: Re-enable Special:SupportedLanguages (duration: 00m 11s) [15:03:09] Logged the message, Master [15:03:15] RECOVERY - dhclient process on labvirt1005 is OK: PROCS OK: 0 processes with command name dhclient [15:03:21] Nikerabbit: ^^^ [15:03:57] manybubbles: site still up [15:04:10] so far as I can tell [15:04:37] Nikerabbit: I did it on commons and it worked [15:05:47] At last... \o/ these years without SupportedLanguages were painful for the translation process. Lots of quality degradation [15:06:42] manybubbles: I also got "the servers are overloaded" as expected while opening all languages in tabs simultaneously [15:06:49] Nemo_bis and Nikerabbit: well, swat's done. I'm glad I could have a small hand in fixing this important problem. thanks for all your hard work on it. [15:07:09] Nikerabbit: ah! yes. is that a pool counter thing? [15:07:44] manybubbles: yes it is [15:08:02] cool. is cached? I just opened half of them quickly and everything just worked fine [15:08:23] manybubbles: yes the results are cached (for 24 hours, though regenerated after 1h if accessed) [15:08:33] sweet [15:08:37] ok then! [15:08:41] * manybubbles is done with swat [15:08:47] party time!!! [15:09:00] except that I have to assist in CX deployment in 55 minutes ;) [15:10:03] Nikerabbit: nice [15:16:02] 6operations, 10RESTBase, 10Traffic, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1231287 (10GWicke) 5Open>3Resolved a:3GWicke Sorry for now updating the ticket yesterday. We decided to go with `/api/rest_v1/`: http://en.... [15:18:06] 6operations, 6Services: Define and then implement a way for a future service owner to provide the info required to have a new service brought into production - https://phabricator.wikimedia.org/T97031#1231293 (10akosiaris) 3NEW [15:21:33] I've got 3 puppet patches that need to be merged so that I can get the new logstash100[4-6] boxes running: https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:logstash,n,z [15:21:49] Which awesome root wants to help with this? [15:22:59] (03CR) 10Chad: "Run it through puppet compiler and check :)" [puppet] - 10https://gerrit.wikimedia.org/r/205969 (owner: 10BryanDavis) [15:23:11] The servers are racked and have a base install of jessie on them thanks to cmjohnson [15:25:08] (03PS1) 10Filippo Giunchedi: graphite: stop system carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/206127 [15:26:29] (03PS7) 10Chad: Hiera-ize the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/204331 [15:26:37] (03CR) 10Chad: "Stupid jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/204331 (owner: 10Chad) [15:28:18] (03PS1) 10Filippo Giunchedi: gdash: adjust graphite dashboards afte carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/206129 [15:28:54] PROBLEM - dhclient process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:30:15] RECOVERY - dhclient process on labvirt1006 is OK: PROCS OK: 0 processes with command name dhclient [15:30:29] (03PS2) 10Filippo Giunchedi: gdash: adjust graphite dashboards afte carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/206129 [15:30:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: adjust graphite dashboards afte carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/206129 (owner: 10Filippo Giunchedi) [15:31:06] Coren: good to merge your change on palladium? [15:34:29] godog: Yeah, sorry about that [15:34:40] git distracted [15:35:00] git: 'distracted' is not a git command. See 'git --help'. [15:35:05] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [15:35:25] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:38:49] 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1231353 (10mark) a:5mark>3faidon [15:39:14] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:39:15] PROBLEM - salt-minion processes on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:39:50] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1231358 (10Eevans) >>! In T93790#1231017, @faidon wrote: > So, assuming servers with single-instance Cassandra nodes, how does the capacity/procurement planning looks like for a) short-te... [15:40:55] RECOVERY - salt-minion processes on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:40:55] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [15:41:11] 6operations: Define and implement an automated process to ease the introduction of a new service into production - https://phabricator.wikimedia.org/T97036#1231364 (10akosiaris) 3NEW [15:47:45] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [15:52:05] (03PS1) 10Chad: Move beta's mediawiki-installation dsh group into hiera [puppet] - 10https://gerrit.wikimedia.org/r/206131 [15:53:25] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1231376 (10faidon) >>! In T93790#1231358, @Eevans wrote: > I assume we have some fairly standard hardware options to choose from, is that information available somewhere? It might be use... [15:54:21] 6operations: Define and implement an automated process to ease the introduction of a new service into production - https://phabricator.wikimedia.org/T97036#1231377 (10akosiaris) [15:55:05] PROBLEM - salt-minion processes on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:55:05] PROBLEM - nova-compute process on labvirt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:56:44] RECOVERY - salt-minion processes on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:56:44] RECOVERY - nova-compute process on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [15:57:21] (03CR) 10Tim Landscheidt: "I share your feelings :-). When I first thought about how to solve this issue, in my mind it was so easy: "class { 'exim4':" in toollabs," [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [15:58:36] (03PS1) 10Chad: dsh: remove template from scap-proxies and just use join() [puppet] - 10https://gerrit.wikimedia.org/r/206132 [15:59:07] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1231396 (10GWicke) Re requirements: Storage is hard to predict exactly (still quite a few variables that can influence it), but I think it's safe to say at this point that we'll at least... [15:59:25] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:59:33] 6operations, 7Service-Architecture: Create a nagios check script that can monitor multiple endpoints based on what the service exposes - https://phabricator.wikimedia.org/T94831#1231399 (10Joe) [15:59:35] 6operations: Define and implement an automated process to ease the introduction of a new service into production - https://phabricator.wikimedia.org/T97036#1231400 (10Joe) [16:00:04] kart_: Dear anthropoid, the time has come. Please deploy Content Translation deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150423T1600). [16:05:19] (03CR) 10Alexandros Kosiaris: [C: 032] service::node: Deps fix + deployment-prep hieradata [puppet] - 10https://gerrit.wikimedia.org/r/206115 (owner: 10Mobrovac) [16:14:04] 6operations, 10Wikimedia-Logstash, 10hardware-requests: eqiad: (3) servers for logstash service - https://phabricator.wikimedia.org/T84958#1231449 (10bd808) [16:14:05] 6operations, 10ops-eqiad, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1231447 (10bd808) [16:14:11] 6operations, 10Incident-20150205-SiteOutage, 6MediaWiki-API-Team, 10MediaWiki-Debug-Logging, and 2 others: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1231446 (10bd808) [16:21:46] 6operations, 10Incident-20150205-SiteOutage, 6MediaWiki-API-Team, 10MediaWiki-Debug-Logging, and 2 others: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1231480 (10bd808) I have a new an improved version of the mediawiki-config patch ( Hm.. what is $ganglia_aggregator for and why do we apply to a subset of nodes only (typically the first and last in a range). Does it somehow figure out the other ones and send it to one of the two aggergators in the same range? Looks very magical to me in puppet. [16:23:46] jouncebot: ok, sir! [16:25:34] PROBLEM - dhclient process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:40] (03CR) 10Krinkle: debug logging: Convert to Monolog logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [16:27:06] YuviPanda, https://gerrit.wikimedia.org/r/#/c/205881/ please :) [16:30:24] RECOVERY - dhclient process on labvirt1006 is OK: PROCS OK: 0 processes with command name dhclient [16:31:49] (03PS8) 10BryanDavis: debug logging: Convert to Monolog logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) [16:31:52] (03CR) 10jenkins-bot: [V: 04-1] debug logging: Convert to Monolog logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [16:32:14] (03CR) 10BryanDavis: debug logging: Convert to Monolog logging (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [16:33:23] !log rebooting labvirt1005 [16:33:27] Logged the message, Master [16:33:28] (03PS6) 10Tim Landscheidt: Tools: Simplify and fix mail setup [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [16:37:25] PROBLEM - Host labvirt1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:38:38] (03PS7) 10Tim Landscheidt: Tools: Simplify and fix mail setup [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [16:38:49] (03PS3) 10Dzahn: Change BZ references to Phabricator tickets [puppet] - 10https://gerrit.wikimedia.org/r/204626 (https://phabricator.wikimedia.org/T96431) (owner: 10Alex Monk) [16:38:56] (03CR) 10jenkins-bot: [V: 04-1] Change BZ references to Phabricator tickets [puppet] - 10https://gerrit.wikimedia.org/r/204626 (https://phabricator.wikimedia.org/T96431) (owner: 10Alex Monk) [16:39:21] (03CR) 10Dzahn: "@godog added ticket reference :) T96431 (https://phabricator.wikimedia.org/T96431)" [puppet] - 10https://gerrit.wikimedia.org/r/204626 (https://phabricator.wikimedia.org/T96431) (owner: 10Alex Monk) [16:39:55] (03PS9) 10BryanDavis: debug logging: Convert to Monolog logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) [16:40:15] RECOVERY - Host labvirt1005 is UPING OK - Packet loss = 0%, RTA = 15.41 ms [16:40:52] (03PS1) 10Krinkle: contint: Set html_errors=0 in php/conf.d [puppet] - 10https://gerrit.wikimedia.org/r/206143 (https://phabricator.wikimedia.org/T97040) [16:42:16] (03CR) 10Tim Landscheidt: "In testing on Toolsbeta, this is a noop for mail clients, and for the mail relay there are only comment changes:" [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [16:45:10] (03CR) 10BryanDavis: "Patch set 8 addressed a bug that Timo spotted and one that I saw." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [16:45:45] PROBLEM - dhclient process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:24] RECOVERY - dhclient process on labvirt1006 is OK: PROCS OK: 0 processes with command name dhclient [16:52:37] (03PS1) 10Aaron Schulz: Changed production innodb_lock_wait_timeout from 50 => 30 [puppet] - 10https://gerrit.wikimedia.org/r/206145 [16:53:38] !log aaron Synchronized php-1.26wmf2/includes/jobqueue/JobRunner.php: d23777e6832f660984ce4445ab04f98b7ff0d25f (duration: 00m 12s) [16:53:41] Logged the message, Master [16:53:48] 6operations: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1231605 (10Papaul) a:5Papaul>3RobH Installation complete on db2043 -db2070 [16:55:45] PROBLEM - dhclient process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:55:51] (03PS11) 10BBlack: Adding a Last-Access cookie to text and mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T88813) (owner: 10Nuria) [16:57:24] RECOVERY - dhclient process on labvirt1006 is OK: PROCS OK: 0 processes with command name dhclient [17:03:14] !log kartik Started scap: Update ContentTranslation [17:03:18] Logged the message, Master [17:06:55] <^d> YuviPanda: I grouped my dsh cleanup commits in a 'dsh-cleanup' topic btw. The one for scap-proxies is /trivial/ as hell [17:06:58] Nom Nom. I need to stick to deployment window. [17:07:15] (and avoid dinner before deployment :/) [17:09:48] 6operations, 10hardware-requests: Procure elasticseach servers for codfw - https://phabricator.wikimedia.org/T97049#1231663 (10RobH) 3NEW a:3RobH [17:13:40] 6operations, 10Wikimedia-DNS: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well - https://phabricator.wikimedia.org/T97051#1231680 (10Dzahn) 3NEW [17:14:46] (03PS2) 10Krinkle: contint: Set php html_errors=0 in php [puppet] - 10https://gerrit.wikimedia.org/r/206143 (https://phabricator.wikimedia.org/T97040) [17:14:53] 6operations, 10Wikimedia-DNS: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well - https://phabricator.wikimedia.org/T97051#1231692 (10Dzahn) [17:16:45] (03PS1) 10Dzahn: touch wikipedia.org template to generate 'gom' [dns] - 10https://gerrit.wikimedia.org/r/206146 (https://phabricator.wikimedia.org/T96468) [17:19:06] (03CR) 10Dzahn: [C: 032] touch wikipedia.org template to generate 'gom' [dns] - 10https://gerrit.wikimedia.org/r/206146 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [17:20:36] (03CR) 10Dzahn: "before: Host gom.wikipedia.org not found: 3(NXDOMAIN)" [dns] - 10https://gerrit.wikimedia.org/r/206146 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [17:22:08] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well - https://phabricator.wikimedia.org/T97051#1231718 (10Dzahn) example: adding this single dot here: https://gerrit.wikimedia.org/r/#/c/206146/1/templates/wikipedia.org... [17:24:12] (03CR) 10Ori.livneh: [C: 04-1] "Small quibbles inline. Much better overall (and quite good in absolute terms)." (037 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [17:24:26] ^d: scap stuck at syncing proxy 99%! [17:24:36] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1231729 (10Dzahn) gom.wikipedia.org had not been added to DNS yet because of T97051. but since the above change i just merged, now it got actually adde... [17:24:40] <^d> ruh roh [17:24:48] <^d> stuck how? just sitting there? how long? [17:25:02] The host is snapshot1004.eqiad.wmnet [17:25:06] snapshot1004.eqiad.wmnet [17:25:08] yeah [17:25:12] 17.06. ie 14 min? [17:25:20] more than that. [17:25:35] you can kill it [17:25:41] and consider it successful [17:25:46] ^d: sync-common: 99% (ok: 464; fail: 0; left: 1) [17:25:52] i'll sync-common on snapshot1004 and if i fail i'll report to ops [17:25:56] ori: how to do that? [17:26:01] CTRL-C [17:26:08] no! [17:26:15] kil the ssh sub process [17:26:30] if you ctrl-c you will have to start over [17:26:40] oh [17:26:42] kart_: ^ [17:26:44] bd808: ssh on my machine? [17:27:05] open another ssh session on tin, and then kill -9 the open ssh connection from there to snapshot1004.eqiad.wmnet [17:27:12] i'll do that [17:27:15] bd808: okay! [17:27:36] wait. it is going forward now :) [17:27:37] kart_: you should be unblocked [17:27:38] bd808: thanks [17:27:38] 6operations, 10Wikimedia-DNS: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well - https://phabricator.wikimedia.org/T97051#1231732 (10Dzahn) [17:27:57] but, 17:27:22 1 apaches had sync errors [17:28:01] !log scap stuck on snapshot1004; not accepting mwdeploy key [17:28:08] bd808: ori ^^ [17:28:09] Logged the message, Master [17:28:14] kart_: yeah, that's snapshot1004 [17:28:14] kart_: expected [17:28:25] okay! Thanks! [17:28:49] bd808: thanks for correcting me there [17:29:04] ori: yw. it's a bit goofy [17:29:27] !log kartik Finished scap: Update ContentTranslation (duration: 26m 11s) [17:29:30] Logged the message, Master [17:29:48] 6operations, 10Wikimedia-DNS: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well - https://phabricator.wikimedia.org/T97051#1231735 (10Glaisher) perhaps just running `$ touch /path/file` works? If it does, we could add that to authdns, right? [17:35:43] ^d: how to fix code with merge conflict in tin? [17:36:12] ^d: it looks something went wrong due to that. [17:36:14] <^d> Why is there a merge conflict? Security patch? Local hack? Something someone didn't push through gerrit? [17:36:45] ^d: it looks that we deployed code in wmf2 only last time and changed code in master. [17:36:54] which created issue, I guess. [17:37:32] <^d> You'll have to resolve the conflict :) [17:38:04] ^d: in tin? [17:38:50] <^d> Eventually get it all committed in gerrit, but you can resolve the conflict on tin or locally, your call [17:39:15] (03PS4) 10Alex Monk: Change BZ references to Phabricator tickets [puppet] - 10https://gerrit.wikimedia.org/r/204626 (https://phabricator.wikimedia.org/T96431) [17:39:35] PROBLEM - dhclient process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:39:40] ^d: blah. Let me see where I messed up. [17:39:51] 7Puppet: Allow per-host hiera customizations on wikitech - https://phabricator.wikimedia.org/T97055#1231778 (10scfc) 3NEW [17:41:04] ^d: and if I fix in tin, then? [17:42:55] RECOVERY - dhclient process on labvirt1006 is OK: PROCS OK: 0 processes with command name dhclient [17:43:56] PROBLEM - puppet last run on ganeti2005 is CRITICAL puppet fail [17:44:57] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1231808 (10Glaisher) (@Ijon We still need the logo and Module/Module_talk namespace translations before starting to work on this) [17:45:26] ^d: ^^ [17:45:59] <^d> kart_: Commit any changes you needed to fix tin to gerrit? [17:48:40] (03PS3) 10Krinkle: contint: Set php html_errors=0 in php [puppet] - 10https://gerrit.wikimedia.org/r/206143 (https://phabricator.wikimedia.org/T97040) [17:49:38] ^d: and then scap? [17:49:47] <^d> Yes [17:50:48] blah [17:51:07] !log kartik Synchronized php-1.26wmf2/extensions/ContentTranslation: (no message) (duration: 00m 15s) [17:51:11] Logged the message, Master [17:51:45] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:51:45] PROBLEM - salt-minion processes on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:52:08] look like do I need full scap? [17:52:15] 6operations, 10Graphoid, 6Services, 10service-template-node, and 2 others: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1231836 (10csteipp) [17:53:15] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [17:53:15] RECOVERY - salt-minion processes on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:59:55] PROBLEM - dhclient process on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:57] (03PS2) 10Jdlrobson: Enable VectorBeta form refresh on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205474 [18:00:46] RECOVERY - puppet last run on ganeti2005 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [18:01:34] RECOVERY - dhclient process on labvirt1006 is OK: PROCS OK: 0 processes with command name dhclient [18:04:52] (03CR) 10Tim Landscheidt: "*argl* All instances have had their Puppet status switched to "stale" on wikitech." [puppet] - 10https://gerrit.wikimedia.org/r/203062 (owner: 10Hashar) [18:05:39] !log rebooting labvirt1006 [18:05:43] Logged the message, Master [18:09:54] PROBLEM - Host labvirt1006 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:55] RECOVERY - Host labvirt1006 is UPING OK - Packet loss = 0%, RTA = 0.94 ms [18:23:24] (03PS1) 10Papaul: removed scs-c8-codfw from dns mgmt files. this opengear was used in the first place to configured the pfw's because they had bad serials port now the problem is fixed and the pfw's are connected to scs-c1-codfw [dns] - 10https://gerrit.wikimedia.org/r/206157 [18:28:23] (03PS2) 10Alex Monk: Removed scs-c8-codfw from DNS mgmt files [dns] - 10https://gerrit.wikimedia.org/r/206157 (owner: 10Papaul) [18:31:36] 6operations, 5Patch-For-Review: Upgrade xenon, cerium and praseodymium to jessie - https://phabricator.wikimedia.org/T90955#1231993 (10Cmjohnson) [18:31:38] 6operations, 10ops-eqiad: additional ssd for xenon, cerium and praseodymium ? - https://phabricator.wikimedia.org/T96841#1231991 (10Cmjohnson) 5Open>3Resolved Removed all the spinning disks, moved the ssds that were occuping slots 2 and 3 to slots 0 and 1 and added the 3rd ssd to slot 2. Powered on [18:33:11] 6operations, 10hardware-requests: Procure elasticsearch servers for codfw - https://phabricator.wikimedia.org/T97049#1231994 (10Aklapper) [18:33:44] 6operations, 10ops-eqiad, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1231996 (10Cmjohnson) a:5Cmjohnson>3bd808 Assigning this t o@ [[ https://phabricator.wikimedia.org | bd808 ]]. Removing the on-site project flag as the on-si... [18:33:57] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1231998 (10Cmjohnson) [18:34:49] (03CR) 10Tim Landscheidt: [C: 04-1] "No, the error is still there:" [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [18:37:26] (03PS4) 10Tim Landscheidt: move misc/labsdebrepo out of misc to module [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [18:43:39] (03PS4) 10Krinkle: contint: Set php html_errors=0 in php [puppet] - 10https://gerrit.wikimedia.org/r/206143 (https://phabricator.wikimedia.org/T97040) [18:54:56] (03Abandoned) 10Krinkle: contint: Set php html_errors=0 in php [puppet] - 10https://gerrit.wikimedia.org/r/206143 (https://phabricator.wikimedia.org/T97040) (owner: 10Krinkle) [18:57:46] 6operations: Create documention on the requesting/allocation of virtual machines in the misc cluster - https://phabricator.wikimedia.org/T97072#1232100 (10RobH) 3NEW a:3RobH [19:08:25] Ok, so the new databases db2043-db2070 have been installed by papaul [19:08:39] papaul: since you dont yet have the shell access to palladium, I'll just advise you what I'm doing [19:08:46] We'll end up getting you this access soon [19:08:59] (and we'll do it in here so anyone who wants to follow along can do so) [19:09:12] ok\ [19:09:51] So, the server lifecycle document has the info on the next steps: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Post-Install:_Get_puppet_running [19:09:55] (03PS1) 10Dzahn: icinga: notification commands for analytics IRC [puppet] - 10https://gerrit.wikimedia.org/r/206163 (https://phabricator.wikimedia.org/T96928) [19:10:07] and this is the most often updated doc on wikitech =] [19:10:21] (i have no proof, but shit it at least is the most shared updated doc) [19:10:37] once a system is installed by our pxe server [19:10:48] it doesnt have puppet running, and it doesn't have our actual root keys set [19:11:10] it has a single root login key, which is on both iron and palladium as /root/.ssh/new_install [19:11:24] so, this means that every single new system has to have a root enable them into service [19:11:42] this stops systems from say.. failing a drive, rebooting into pxe, installing, and then turning back on [19:11:44] and into service... [19:11:51] it has happened. [19:12:26] so, i go on palladium, and login to the new server with the -i ssh flag [19:13:33] papaul: actually, i'm goingg to do this in a slightly different order than the list [19:13:37] what is the role of iron? [19:13:44] iron is an operations only bastion [19:13:48] So all the non roots use bast1001 [19:13:54] to enter our network via ssh [19:13:58] ok [19:14:02] but ops, using root level stuff, we have a restricted ops only bastion [19:14:34] so, when we configure your shell access later, you'll have to setup ssh config on your system [19:14:46] and it'll reference iron, where normal users reference bast1001 [19:14:46] (03PS1) 10Dzahn: icinga: add irc bot user to analytics group [puppet] - 10https://gerrit.wikimedia.org/r/206164 (https://phabricator.wikimedia.org/T96928) [19:15:06] Now, normally we have to connect to each system and tell it to call puppet [19:15:13] so it can get told 'you dont have a signed key' [19:15:19] but, since these were installed over 30 minutes ago [19:15:23] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1232167 (10bd808) I have four Puppet patches for this: * - logstash: Remove redis input * they've all automatically tried their first call in on their own [19:15:37] and palladium (puppetmaster) told them they didnt have accepted/signed keys [19:15:48] so, i dont have to do the first step of logging in and firing of the first run [19:16:01] i can see this, because on palladium i run: sudo puppet cert -l [19:16:05] robh: since you have the on-call stick right now: what do I need to do next to keep https://phabricator.wikimedia.org/T96692 moving forward? [19:16:06] to list all the unsigned puppet certs [19:16:25] (03CR) 10Dzahn: [C: 032] icinga: notification commands for analytics IRC [puppet] - 10https://gerrit.wikimedia.org/r/206163 (https://phabricator.wikimedia.org/T96928) (owner: 10Dzahn) [19:16:36] bd808: i total plan to pick up and run with that shortly =] [19:16:45] awesome [19:16:53] but it may not be until late today, or possible tomorrow morning, depending on my day [19:17:04] (03CR) 10Dzahn: [C: 032] icinga: add irc bot user to analytics group [puppet] - 10https://gerrit.wikimedia.org/r/206164 (https://phabricator.wikimedia.org/T96928) (owner: 10Dzahn) [19:17:06] i'll steal the ticket though so it doesnt get forgotten [19:17:12] that works for me. Just don't want it rotting in the backlog again :) [19:17:16] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1232175 (10RobH) a:5bd808>3RobH [19:17:42] i actually skimmed all the stuff in the email plus patchsets and i should have pinged you to let ya know ;D [19:17:50] 6operations, 10ops-eqiad, 10Incident-20141130-eqiad-C4: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1232176 (10Cmjohnson) The current OS of the backup switch is Hostname: asw-a-eqiad Model: ex4200-48t JUNOS Base OS boot [10.4R3.4] JUNOS Base OS Software Suite [10.4R3.4] JUNOS Ker... [19:18:07] papaul: So, since they called in, i'll do the steps under https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Post-Install:_Get_puppet_running [19:18:19] and then go from step on the puppet master (definitely palladium, this time), run puppet cert -l to list all pending certificate signings. [19:18:37] then im going to accept the keys for every single new db [19:20:20] so once the server finished they first install they do send a certificate request automatically to palladium? [19:21:23] they do within 30 minutes of it finishing [19:21:33] so if you want it immediately, you have to login via ssh -i and fire puppet off [19:21:43] ok [19:21:45] but if you dont need it to happen asap, you can just go away for 30 and walk back to it waiting [19:21:50] (which is usually nicer ;) [19:21:58] ok [19:22:20] (03PS10) 10BryanDavis: debug logging: Convert to Monolog logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) [19:22:22] so im lazy though, so i wrote a for loop to accept all my certs ;D [19:22:34] for db in db{2043..2070}; do echo $db;sudo puppet cert -s $db.codfw.wmnet;done [19:22:52] and i'll use the same loop to ssh in and enable puppet on all the systems [19:23:16] (03PS1) 10Dzahn: icinga-wm: configure to join -analytics channel [puppet] - 10https://gerrit.wikimedia.org/r/206167 (https://phabricator.wikimedia.org/T96928) [19:23:45] papaul: so once we accept the key [19:23:52] puppet still wont run on the servers [19:23:56] as the default puppet state is disabled [19:24:04] again, this is to ensure we don't push a server into service prematurely [19:24:17] plus saves headaches on other things [19:24:26] so, we need to enable it across the new hosts [19:24:37] enter for loop: for db in db{2043..2070}; do echo $db;sudo ssh -i /root/.ssh/new_install root@$db.codfw.wmnet "puppet agent --enable";done [19:24:49] (03CR) 10Merlijn van Deen: "Looks good to me. Inline there's a few nitpicks, and a more general idea: could/should we do host config via hiera instead of wikitech? We" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [19:25:33] (03CR) 10Dzahn: [C: 032] icinga-wm: configure to join -analytics channel [puppet] - 10https://gerrit.wikimedia.org/r/206167 (https://phabricator.wikimedia.org/T96928) (owner: 10Dzahn) [19:25:51] papaul: if you arent sure how to write for loops, copy those down for later re-use (you'll appreciate it later ;) [19:26:16] you can use them to quickly gather all your mac address info for lease file updates [19:27:07] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Add icinga-wm bot to #wikimedia-analytics - https://phabricator.wikimedia.org/T96928#1232215 (10Dzahn) 1) create special icinga contact in private repo: ``` 51 define contact{ 52 contact_name irc-analytics 53 al... [19:27:15] So, now we have accepted all of the keys, and enabled puppet on the new hosts [19:27:31] we can proceed one of two ways, wait and let them all call in on their own or force them in [19:27:40] once they do their initial puppet runs, we'll need to accept the salt keys [19:27:50] so puppet we use to manage files, packages, manifests, etc.. [19:27:59] salt we use to execute immediate commands across cluster [19:28:12] and puppet puts the salt config files in place [19:28:22] so we have to accept those salt keys AFTER they complete their first puppet run [19:28:28] since they dont exist before it [19:28:41] if its a hurry, you force the puppet call in to get the key generated [19:28:58] if its not a hurry (like this), I'll usually just go do something else for 30 minutes and come back to this after they call in on their own [19:29:05] and then sign their keys [19:29:19] (instructions for accepting keys are also on https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Post-Install:_Get_puppet_running) [19:29:56] papaul: you don't have to stick around for that part, since its just accepting the keys, andwith this many, its going to take a long time for them to all force call in [19:30:03] its really just more efficient for this to wait for htem to do it on their own [19:30:12] but once the salt key is accepted [19:30:24] i'd update the task that its ready for service implementation, and hand it off to the appropriate person [19:30:47] exmaples: brandon took over varnish service implementation, joe did the apaches and a bunch of stuff [19:30:51] sean does databases [19:31:01] so when salt is accepted on all these, i'll assign the task to sean [19:31:11] !log restarting icinga-wm for config change [19:31:19] Logged the message, Master [19:31:45] papaul: So, next week we'll look at getting you started with some shell access. I don't think you'll be getting root right away, but dunno yet [19:31:59] ok [19:32:02] at minimum i want to get you the onsite access to shell for basic troubleshooting [19:32:09] and completing installs like this [19:35:02] CUSTOM - Check status of defined EventLogging jobs on eventlog1001 is OK All defined EventLogging jobs are runnning. [19:35:19] (03CR) 10BryanDavis: debug logging: Convert to Monolog logging (037 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [19:35:23] ^ test to see that those now show up in analytics channel, and they do [19:36:06] Robh: thanks for the training [19:36:11] welcome =] [19:37:21] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Add icinga-wm bot to #wikimedia-analytics - https://phabricator.wikimedia.org/T96928#1232238 (10Dzahn) 5Open>3Resolved 5. restart icinga-wm root@neon: /etc/init.d/ircecho restart see it join the channel: 12:31 -!- icinga-wm [~icinga-wm@neon.wik... [19:37:36] jgage: ^ [19:38:01] robh: i did not understand the netboot.cfg part but will look into it again and get back with you if i have any questions [19:38:21] yea, that file does a bunch of stuff [19:38:31] it sets the network parameters based of what subnet the rquest comes in on [19:38:36] and also sets the disk partitioning [19:39:19] based off hostname [19:39:20] robh: ok thanks [19:39:35] once we have you shell on the install server [19:39:41] you can watch it serve out info based off the file [19:39:44] and it makes a bit more sense [19:39:49] ok [19:49:15] (03CR) 10RobH: [C: 032] "CANNOT MERGE UNTIL 2015-04-27 for the 3 business day wait period" [puppet] - 10https://gerrit.wikimedia.org/r/205917 (https://phabricator.wikimedia.org/T96886) (owner: 10RobH) [19:49:22] (03CR) 10RobH: [C: 04-2] mholloway granted access as releaser-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/205917 (https://phabricator.wikimedia.org/T96886) (owner: 10RobH) [19:49:32] ahhh, habit to +2 when i meant to -2 [19:50:04] incites a moment of panic until you realize you didnt submit the patchset, just review. [19:50:20] haha, done that more times than I'd like to admit [19:51:55] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1232251 (10Eevans) >>! In T93790#1148439, @faidon wrote: > If we're buying more, can we also figure out our plans (& procurement) for codfw? How would an additional DC fit in our plans? W... [19:54:47] 6operations, 5Patch-For-Review: contacts.wikimedia.org drupal unpuppetized / retire contacts - https://phabricator.wikimedia.org/T90679#1232252 (10Dzahn) I have created dumps of the 2 (!) databases using mysqldump. One is "contacts_drupal" and the other "contacts_civicrm" (because CiviCRM ran on top of Drupal)... [20:00:10] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [20:00:31] PROBLEM - puppet last run on cp3022 is CRITICAL puppet fail [20:01:17] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [20:08:07] Hi [20:08:15] Are you doing any updates right now? [20:09:09] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1232270 (10GWicke) I'm also in favor of staying with three replicas for now. > What is the purpose of the cluster in codfw? Is it a read-only hot spare? Disaster recovery? Geo-aware acti... [20:09:29] I'm getting some unusal misloading of style-sheets [20:09:44] Minor annnoyance.... but thought I'd check anyway [20:09:47] (From UK) [20:10:01] Me too, 1 % packet loss [20:11:36] (03PS1) 10Dzahn: backup: add new fileset for contacts.wm [puppet] - 10https://gerrit.wikimedia.org/r/206174 (https://phabricator.wikimedia.org/T90679) [20:13:20] Qcoder00: http://www.init7.net/de/status/&ticket=8705 [20:13:41] Ah... [20:13:46] Are you going through them? [20:13:56] Technical problems outside your control... [20:14:02] That's probably it [20:14:42] (03PS1) 10Mattflaschen: Remove Flow cache version override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206175 (https://phabricator.wikimedia.org/T96951) [20:15:02] Doesn't help that they pretend nothing is affected [20:15:12] I assume it's being raised with them... [20:15:53] If Telehouse East is a routing point between UK and NL it wouldn't suprise me if an outage gave some ocassional mis-loads [20:16:45] hmm yes there is a little hiccup visible here: http://smokeping.wikimedia.org/?target=ESAMS.Core [20:17:20] The funny part is that, as always, the packet loss happens after r1ams2.core.init7.net, at gw-wikimedia.init7.net [20:18:02] So the loss is on whose side? [20:18:17] RECOVERY - puppet last run on cp3022 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [20:18:46] Anyway... It's not looking like a Wikimedia side issue... so It's outside your control? [20:19:00] agreed :\ [20:19:33] thank you for reporting, i'll keep an eye on it [20:20:27] (03PS1) 10Dzahn: contacts: remove Apache, add Backup [puppet] - 10https://gerrit.wikimedia.org/r/206176 (https://phabricator.wikimedia.org/T90679) [20:20:46] Thanks [20:23:16] (03CR) 10Tim Landscheidt: [C: 04-1] "The template templates/misc/labsdebrepo.erb needs to go under modules/labs_debrepo/templates/ with amended path, and "dir" in there needs " [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [20:24:57] Looks like it's been a busy day for them http://www.init7.net/en/status/?ticket=8704 [20:30:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [20:32:57] (03PS1) 10Yuvipanda: tools: Temporarily and sadly puppetize /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/206182 [20:33:06] valhallasw`cloud: ^ [20:33:11] (03PS2) 10Mattflaschen: Bump Flow cache to 4.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206175 (https://phabricator.wikimedia.org/T96951) [20:34:03] YuviPanda: care to fix those line endings? :P [20:35:28] (03PS2) 10Yuvipanda: tools: Temporarily and sadly puppetize /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/206182 [20:35:33] valhallasw`cloud: ^ [20:36:34] (03CR) 10Merlijn van Deen: [C: 031] tools: Temporarily and sadly puppetize /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/206182 (owner: 10Yuvipanda) [20:37:00] Coren: ^ [20:37:08] Just temporarily puppetizing it [20:41:52] YuviPanda: but as far as you know, there's no tool that pulls the repo and tells you what puppet would change if run? [20:42:08] valhallasw`cloud: there’s the puppet compiler but that’s just for production [20:42:39] ? [20:43:33] essentially, in my perfect world, pushing a puppet change for tool labs would to a fake run of that change on all toolsbeta servers and tell you the diff [20:44:00] and then a second tool to do a manual apply-and-reset [20:44:14] valhallasw`cloud: yeah, so we can do that on toolsbeta if we have a fulls etup there [20:44:23] valhallasw`cloud: so there’s toolsbeta-puppetmaster3 I think [20:44:29] valhallasw`cloud: so that’s set as puppetmaster for all of toolsbeta [20:44:42] so you can go there and apply your patch to /var/lib/git/operations/puppet [20:44:47] and run puppet on the appropriate hosts to check [20:44:53] but toolsbeta doesn’t have everything tools has [20:44:57] so we can’t test all of it... [20:45:10] sure, but it's better than the 'well let's hope this works' we do now :P [20:45:12] so the path from here to perfect world is to setup all of this on toolsbeta too [20:45:14] yeah :) [20:47:50] (03CR) 10Dzahn: [C: 032] backup: add new fileset for contacts.wm [puppet] - 10https://gerrit.wikimedia.org/r/206174 (https://phabricator.wikimedia.org/T90679) (owner: 10Dzahn) [20:48:16] (03PS3) 10Yuvipanda: tools: Temporarily and sadly puppetize /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/206182 [20:48:25] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Temporarily and sadly puppetize /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/206182 (owner: 10Yuvipanda) [20:49:03] sadpanda sadly puppetizes sad hosts ?:) [20:49:15] (03CR) 10Milimetric: Adding a Last-Access cookie to text and mobile requests (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T88813) (owner: 10Nuria) [20:49:24] 7Puppet, 6Labs: Fix Puppet timestamp updater for wikitech - https://phabricator.wikimedia.org/T97082#1232425 (10scfc) 3NEW a:3scfc [20:49:40] oh god, when I saw that at first I thought it was for prod, I didn't see the tools: [20:49:43] I was like, WTF [20:50:02] bblack: :) this is better than manually copying a version of the file across hosts…. [20:50:02] :) [20:50:04] but still horrible [20:50:12] but I’m tired of stray unupdated files fucking shit up [20:50:35] bblack: mutante soon we’ll have better DNS and all of this can go die [20:51:38] sounds good! [20:52:13] (03CR) 10Dzahn: [C: 032] contacts: remove Apache, add Backup [puppet] - 10https://gerrit.wikimedia.org/r/206176 (https://phabricator.wikimedia.org/T90679) (owner: 10Dzahn) [20:52:59] (03CR) 10Gage: "@Manybubbles, can you please comment on the question posed in hieradata/role/common/logstash/elasticsearch.yaml, "should multicast_group b" [puppet] - 10https://gerrit.wikimedia.org/r/205970 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [20:58:09] (03CR) 10Manybubbles: "@Gage - I didn't see the question - sorry!" [puppet] - 10https://gerrit.wikimedia.org/r/205970 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [20:59:37] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [21:03:01] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to caesium for Michael Holloway - https://phabricator.wikimedia.org/T96886#1232509 (10dr0ptp4kt) Approved. [21:04:52] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to caesium for Michael Holloway - https://phabricator.wikimedia.org/T96886#1232515 (10RobH) a:3RobH As long as nothing comes up during the 3 day wait, I'll merge next Monday, 2015-04-27 [21:13:05] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to caesium for Michael Holloway - https://phabricator.wikimedia.org/T96886#1232546 (10bearND) @RobH He'll need releasers-mobile, but not releasers-mediawiki. Same as dbrant, and bsitzmann basically. Thanks! [21:14:35] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to caesium for Michael Holloway - https://phabricator.wikimedia.org/T96886#1232547 (10RobH) Ahh, my patchset is wrong then, I'll fix it shortly. Thanks! [21:16:36] 6operations, 5Patch-For-Review: contacts.wikimedia.org drupal unpuppetized / retire contacts - https://phabricator.wikimedia.org/T90679#1232576 (10Dzahn) confirmed zirconium is now a backup host and shows up on the bacula director and there is a new fileset that backs up the database dump files i created [21:17:27] (03CR) 10Gage: "Great, thanks for demystifying." [puppet] - 10https://gerrit.wikimedia.org/r/205970 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [21:19:25] (03PS4) 10BryanDavis: logstash: Convert $::realm switches to hiera [puppet] - 10https://gerrit.wikimedia.org/r/205969 [21:19:27] (03PS3) 10BryanDavis: logstash: Convert Elasticsearch on logstash100[1-3] to client [puppet] - 10https://gerrit.wikimedia.org/r/205971 (https://phabricator.wikimedia.org/T96814) [21:19:29] (03PS4) 10BryanDavis: logstash: Provision Elasticsearch only backend hosts [puppet] - 10https://gerrit.wikimedia.org/r/205970 (https://phabricator.wikimedia.org/T96814) [21:19:38] (03PS5) 10Gage: logstash: Provision Elasticsearch only backend hosts [puppet] - 10https://gerrit.wikimedia.org/r/205970 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [21:20:14] heh we raced, bd808 [21:20:22] oops [21:20:33] I was just getting rid of the FIXME comment [21:20:46] yeah i just nuked that line and changed the IP to .6 [21:21:03] ah. ok [21:21:35] Probably will require a cluster restart to take effect [21:21:49] <^d> For what? [21:22:19] will logstash1001-3 continue running elasticsearch after 4-6 are added? [21:23:28] i don't feel strongly about changing the ip, it just seemed like there's little reason to overlap and the clarity might be helpful [21:23:46] jgage: yes, all 6 will run ES and join to the same cluster [21:24:02] ok [21:24:04] The last step will be to make the 1-3 nodes client only (no data, no master) [21:24:26] but taht will let logstash and kibana talk to ::1 to get into the cluster [21:24:48] and the ES magic will take care of routing messages to the right data nodes [21:24:55] great [21:26:43] (03CR) 10BBlack: Adding a Last-Access cookie to text and mobile requests (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T88813) (owner: 10Nuria) [21:27:01] (03PS12) 10BBlack: Adding a Last-Access cookie to text and mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T88813) (owner: 10Nuria) [21:29:16] (03CR) 10Gage: [C: 032] logstash: Provision Elasticsearch only backend hosts [puppet] - 10https://gerrit.wikimedia.org/r/205970 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [21:32:03] Wheee :D [21:32:31] (03CR) 10Dzahn: "would have merged if it wasn't also touching Apache config files. that requires deploying" [puppet] - 10https://gerrit.wikimedia.org/r/204626 (https://phabricator.wikimedia.org/T96431) (owner: 10Alex Monk) [21:33:13] (03PS2) 10Ori.livneh: Convert MWLogger to MediaWiki\Logger\LoggerFactory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201987 (owner: 10BryanDavis) [21:33:43] (03PS3) 10RobH: mholloway granted access as releaser-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/205917 (https://phabricator.wikimedia.org/T96886) [21:33:45] omg. it's merge bd808's logging stuff day! [21:34:14] (03CR) 10Dzahn: "the comments are a bit confusing. it's like you guys agree on _not_ merging this but also there is a +1 ?" [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:35:13] (03CR) 10Mholloway: [C: 031] mholloway granted access as releaser-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/205917 (https://phabricator.wikimedia.org/T96886) (owner: 10RobH) [21:35:15] (03CR) 10BryanDavis: "1.26wmf1 is everywhere now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201987 (owner: 10BryanDavis) [21:35:39] (03CR) 10Ori.livneh: [C: 032] debug logging: Convert to Monolog logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [21:35:46] (03Merged) 10jenkins-bot: debug logging: Convert to Monolog logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [21:35:52] (03CR) 10Ori.livneh: [C: 032] Convert MWLogger to MediaWiki\Logger\LoggerFactory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201987 (owner: 10BryanDavis) [21:36:19] (03Merged) 10jenkins-bot: Convert MWLogger to MediaWiki\Logger\LoggerFactory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201987 (owner: 10BryanDavis) [21:36:35] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1232677 (10RobH) argh, this task is a mess... in the future lets use the template laid out on https://wikitech.wikimedia.org/wiki/Phabricator#Hardware.2FServer_Setup_.2F_Deplo... [21:36:43] (03CR) 10Dzahn: [C: 031] base: vim -> vim-nox [puppet] - 10https://gerrit.wikimedia.org/r/203342 (owner: 10Hashar) [21:36:50] (03PS2) 10Dzahn: base: vim -> vim-nox [puppet] - 10https://gerrit.wikimedia.org/r/203342 (owner: 10Hashar) [21:36:57] (03CR) 10Dzahn: [C: 031] base: vim -> vim-nox [puppet] - 10https://gerrit.wikimedia.org/r/203342 (owner: 10Hashar) [21:39:33] bd808, is there a task or something explaining why we're removing redis input? does using udp syslog help us decouple mw more than a message queue would? [21:40:17] in theory it redis input sounded nice for logs from hadoop [21:40:42] s/it / [21:41:39] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1232681 (10RobH) [21:42:08] (03PS2) 10Gage: logstash: Remove redis input [puppet] - 10https://gerrit.wikimedia.org/r/205968 (owner: 10BryanDavis) [21:42:18] (03CR) 10Dzahn: [C: 032] tiny formatting fix in apache_status.pyconf [puppet] - 10https://gerrit.wikimedia.org/r/206006 (owner: 10Dzahn) [21:42:28] !log ori Synchronized wmf-config: Ie22658727 and Ice65e7e70: use Monolog to configure logging (duration: 00m 15s) [21:42:33] Logged the message, Master [21:42:34] ^ bd808 [21:43:14] undef var in fatal montitor [21:43:28] wmgDefaultMonologHandler [21:43:40] and wmfMOnologChannels [21:43:52] don't look to be growing though... [21:43:52] momentary race condition due to sync? [21:43:56] yeah [21:44:08] (03PS3) 10Gage: logstash: Remove redis input [puppet] - 10https://gerrit.wikimedia.org/r/205968 (owner: 10BryanDavis) [21:44:12] hmm unclear why i had to rebase that twice [21:44:17] maybe jsut the burp from parsing InitializeSettings [21:44:31] jgage: for fun! [21:44:34] yay! [21:44:52] (03PS2) 10Dzahn: install-server: partman for dm-cache [puppet] - 10https://gerrit.wikimedia.org/r/200134 (https://phabricator.wikimedia.org/T88994) (owner: 10Filippo Giunchedi) [21:45:09] jgage: I wonder if rebases get you more tip4commit [21:45:10] bd808: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=fluorine.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+eqiad [21:45:15] (03CR) 10Gage: [C: 032] logstash: Remove redis input [puppet] - 10https://gerrit.wikimedia.org/r/205968 (owner: 10BryanDavis) [21:45:29] woah [21:45:48] PROBLEM - HHVM rendering on mw1193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50518 bytes in 0.023 second response time [21:45:48] PROBLEM - HHVM rendering on mw1181 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50518 bytes in 0.040 second response time [21:45:48] PROBLEM - HHVM rendering on mw1121 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50518 bytes in 0.025 second response time [21:45:56] PROBLEM - Apache HTTP on mw1181 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50518 bytes in 0.020 second response time [21:45:56] PROBLEM - Apache HTTP on mw1058 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50518 bytes in 0.016 second response time [21:46:16] PROBLEM - Apache HTTP on mw1193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50518 bytes in 0.019 second response time [21:46:37] PROBLEM - Apache HTTP on mw1121 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50518 bytes in 0.019 second response time [21:47:27] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 63553 bytes in 0.598 second response time [21:47:36] RECOVERY - HHVM rendering on mw1181 is OK: HTTP OK: HTTP/1.1 200 OK - 63553 bytes in 0.124 second response time [21:47:36] RECOVERY - Apache HTTP on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [21:47:37] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.046 second response time [21:47:51] ori: Not sure why it would jump that much [21:47:57] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [21:48:17] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.104 second response time [21:48:30] it's climbing back down [21:48:40] k [21:48:53] hhvm.log looks ok [21:48:55] (03CR) 10Dzahn: "@ArielGlenn: seems it might not be needed. at least now there was nothing older than 90 days:" [puppet] - 10https://gerrit.wikimedia.org/r/197081 (owner: 10ArielGlenn) [21:49:09] It was the damn change to InitializeSettings [21:49:17] RECOVERY - HHVM rendering on mw1121 is OK: HTTP OK: HTTP/1.1 200 OK - 63553 bytes in 0.982 second response time [21:49:20] I hate how fragile that is [21:49:58] apart from the heart attack and the alerts it doesn't look like we served a lot of errors: https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22) [21:50:35] oh.. and I know what jumped the volume up. I turned the xff log back on [21:50:38] it's not even the biggest failure in the last hour [21:50:58] (the big jump in 20:00 is) [21:51:10] but still not a happy experience [21:51:27] (03CR) 10Dzahn: [C: 04-1] "could we add those privileges to a new proper admin group we add to the admin/module rather than using wikidev and setting permissions her" [puppet] - 10https://gerrit.wikimedia.org/r/189604 (owner: 10Ori.livneh) [21:52:11] The "right" way to add new vars to InitializeSettings is in 2 separate syncs. One to introduce and then a second to use [21:52:18] which is crappy [21:52:50] Does rsync do it in alpha order? Or random? [21:53:49] Hmmm.. inode order on the dest server I think but not sure honestly [21:54:14] (03CR) 10Dzahn: [C: 031] dumps: improve nginx disk utilisation via directio [puppet] - 10https://gerrit.wikimedia.org/r/190940 (owner: 10Ori.livneh) [21:54:18] (03PS2) 10Dzahn: dumps: improve nginx disk utilisation via directio [puppet] - 10https://gerrit.wikimedia.org/r/190940 (owner: 10Ori.livneh) [21:54:20] Renaming Initialise to something < Common would be pretty hacky :) [21:54:22] !log Syncing Ie22658727 and Ice65e7e70 (which introduce new InitialiseSettings vars) in one go caused a small burst of 500s (peaking at 500/sec and lasting a few seconds) on four app servers. [21:54:26] Logged the message, Master [21:54:31] yeah inode order is my guess, was just googling for confirmation [21:55:06] anything other than inode order would require a sort and that would be goofy for rsync to do [21:56:06] you can use --file-from and give it a text file to change the order [21:56:31] mostly we need to build a sane config system ;) [21:56:44] a 15,000 line file of php is not sane [21:56:47] !log Additional (planned) outcome of Ie22658727 and Ice65e7e70: xff log flowing to fluorine, causing bytes-in to climb from ~1.2M/s to ~2.1M/s [21:56:50] !hss [21:56:50] Logged the message, Master [21:56:57] oh that’s gone :) nvm [21:57:15] bd808: spoil sport [21:57:53] I already got my shirt; nobody else needs one [21:58:24] bd808: btw, my mum was going to frame my "I broke Wikipedia.. but then I fixed it" t-shirt [21:58:31] Dunno if she got round to doing it [21:58:51] bd808: Things look good. I'm going to go grab a bite in a couple of minutes if that's cool. [21:59:03] Son deeds [21:59:13] Reedy: You should wear it while your doing touch and go practice [21:59:28] I have to wear a white shirt, epaulettes and shit :/ [21:59:33] ori: works for me. I'm watching things [21:59:56] coolio [22:00:13] * bd808 can't wait to bump into captain Reedy in an airport :) [22:00:17] https://scontent-atl.xx.fbcdn.net/hphotos-xaf1/v/t1.0-9/11043186_10155318485830385_7035036673281394323_n.jpg?oh=9e39556118c6a8ddeb8299583668546d&oe=55D45F00 [22:01:57] ook. I need to make a followup to change the memcached log level [22:03:21] (03PS5) 10Gage: logstash: Convert $::realm switches to hiera [puppet] - 10https://gerrit.wikimedia.org/r/205969 (owner: 10BryanDavis) [22:03:33] bd808: ^ changed to 224.2.2.6 to match prod [22:04:22] jgage: it's .5 in hieradata/role/common/logstash.yaml in that patch [22:04:31] oops, thanks [22:04:44] so you are making them all .6 right? [22:04:53] yeah [22:04:59] k [22:05:01] (03PS6) 10Gage: logstash: Convert $::realm switches to hiera [puppet] - 10https://gerrit.wikimedia.org/r/205969 (owner: 10BryanDavis) [22:05:15] We have clearance, Clarence. Roger, Roger. What's our vector, Victor? We're ready, Reedy. [22:06:40] gah i missed the one in logstash.pp. too eager. [22:06:48] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1232717 (10Mjbmr) Yes, fallback language should be Hindi, and for the logo just use this [[https://commons.wikimedia.org/wiki/File:Wikipedia-logo-v2-hi.... [22:07:14] (03PS1) 10BryanDavis: logstash: Fix log level detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206299 [22:07:46] oh no i misread the diff [22:09:30] (03CR) 10BryanDavis: [C: 032] logstash: Fix log level detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206299 (owner: 10BryanDavis) [22:09:35] (03Merged) 10jenkins-bot: logstash: Fix log level detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206299 (owner: 10BryanDavis) [22:10:10] (03CR) 10Gage: [C: 032] logstash: Convert $::realm switches to hiera [puppet] - 10https://gerrit.wikimedia.org/r/205969 (owner: 10BryanDavis) [22:10:43] (03PS1) 10Dzahn: set logo for gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) [22:10:51] !log bd808 Synchronized wmf-config/logging.php: logstash: Fix log level detection (c09014d) (duration: 00m 17s) [22:10:57] Logged the message, Master [22:12:43] Reedy: Have you scapped since I added the flying pig logo? It's the greatest innovation in the python port :) [22:12:45] https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29#/media/File:Scap-logo-white-on-black.png [22:12:49] Haha [22:13:02] I've not... But I'm sure I saw someone post it [22:13:16] :) [22:13:21] well maybe second greatest. The progress bars you asked for are pretty nice too [22:13:29] (03PS2) 10Dzahn: set logo for gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) [22:17:36] (03CR) 10Dzahn: [C: 032] monitoring: selector outside a resource [puppet] - 10https://gerrit.wikimedia.org/r/195520 (owner: 10Matanya) [22:19:14] how does this even work? [22:19:18] target => '/etc/nagios/puppet_hosts.cfg', [22:19:30] /etc/nagios/puppet_hosts.cfg: ERROR: cannot open `/etc/nagios/puppet_hosts.cfg' (No such file or directory) [22:20:22] mutante: are you wondering why it fails, or just annoyed that there’s broken puppet code? [22:20:58] andrewbogott: i am wondering how it's possible that it does NOT fail [22:21:11] :) [22:21:13] there is puppet code that specifies this file as a target [22:21:19] Only if /etc/nagios exists… [22:21:20] and it doesnt even exist on neon [22:21:26] It’s making a link, right? [22:21:28] or, trying? [22:22:11] # Exports the resource that monitors hosts in icinga/shinken [22:22:26] i don't see that "if" yet [22:22:42] but maybe in one of the other modules :p [22:23:02] No, I just meant... [22:23:14] well, now that I’m looking at the code, I have no idea. Maye target is just ignored by that class. [22:30:00] andrewbogott: probably replaced by icinga::naggen [22:30:06] 8 file { '/etc/icinga/puppet_hosts.cfg': [22:30:16] 9 content => generate('/usr/local/bin/naggen2', '--type', 'hosts'), [22:30:19] yeah [22:30:32] so that code is not used and just cruft i think [22:31:34] easy to get lost between modules icinga, nagios_common, monitoring and private repo [22:36:10] (03Abandoned) 10Nuria: Change metric prefix after logster changes [puppet] - 10https://gerrit.wikimedia.org/r/200593 (owner: 10Nuria) [22:36:36] Your 131072x1 screen size is bogus. Expect trouble. [22:36:36] Notice: Finished catalog run in 50.05 seconds [22:41:12] (03CR) 10Mjbmr: "I thought you guys are going to upload it at local [[File:Wiki.png]], perhaps it has to be uploaded at [[File:Wikipedia-logo-v2-gom.svg]]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [22:42:23] (03CR) 10Dzahn: "why local? all those other logos are on commons as well (and using .png instead of .svg)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [22:48:29] !log upgrading labvirt1001 to linux-image-3.16.0-34-generic, dist-upgrading, and rebooting [22:48:34] Logged the message, Master [22:53:15] PROBLEM - Host labvirt1001 is DOWN: PING CRITICAL - Packet loss = 100% [23:00:00] jgage: https://twitter.com/init7/status/591375480919609345 [23:00:04] RoanKattouw, ^d, rmoen, mattflaschen: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150423T2300). Please do the needful. [23:00:04] RoanKattouw, ^d, rmoen, mattflaschen: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150423T2300). [23:00:29] double jouncebot action! [23:00:59] woot... Whos first? [23:01:05] jouncebot_: die [23:01:06] !next [23:01:22] jouncebot: refresh [23:01:23] I refreshed my knowledge about deployments. [23:01:31] jouncebot: next [23:01:31] I'd like to get one last one in: https://gerrit.wikimedia.org/r/#/c/206314/ [23:01:42] I can set up the bump as soon as it merges. [23:02:06] grr [23:02:39] ok. Alright if i go for MobileFrontend ? [23:03:02] jouncebot: next [23:03:44] jouncebot: refresh [23:03:44] I refreshed my knowledge about deployments. [23:03:47] jouncebot: next [23:10:25] rmoen, I added the patch I mentioned, if that's okay. Bump is https://gerrit.wikimedia.org/r/#q,206322,n,z [23:11:05] I think so.. mattflaschen can you hang around for a bit? Doing MF update [23:11:06] first [23:11:18] Sure, that's fine. [23:13:16] RECOVERY - Host labvirt1001 is UPING OK - Packet loss = 0%, RTA = 1.89 ms [23:20:27] rmoen: Can I tack on a late addition? [23:20:41] uhh i might run out of time [23:20:45] RoanKattouw: :) [23:20:54] OK [23:21:00] I'll wait for you to be done with normal stuff [23:21:05] And then I'll ask greg-g for permission [23:21:05] thanks [23:21:23] ori: I wonder if your deduplication script is making UP: PING OK to UPING OK [23:21:26] because it matches P: P [23:22:03] tell me about this deduplication script [23:22:12] i was just trying to find the source of mysterious UPING [23:22:23] but neon:/usr/lib/nagios/plugins/check_ping is a binary [23:22:56] jgage: it’s built into ircecho [23:23:01] oho [23:23:02] jgage: modules/ircecho/files/ircecho [23:23:06] thanks [23:23:48] wow i didn't know you can have a space after a hashbang [23:24:19] YuviPanda: it is [23:30:06] !log rmoen Synchronized php-1.26wmf3/extensions/MobileFrontend/: Update MobileFrontend to cherry picks (duration: 00m 38s) [23:30:11] Logged the message, Master [23:30:13] jouncebot_: next [23:30:16] hmm [23:30:20] jouncebot: next [23:30:58] !log rmoen Synchronized php-1.26wmf2/extensions/MobileFrontend/: Update MobileFrontend to cherry picks (duration: 00m 20s) [23:31:01] Logged the message, Master [23:33:39] rmoen said it's okay if I just do the cache bump myself to save a little time. [23:33:55] mattflaschen: thanks, i'm starting https://gerrit.wikimedia.org/r/#/c/206322/ [23:35:50] CUSTOM - Host analytics1034 is UPING OK - Packet loss = 0%, RTA = 2.26 ms [23:35:59] what the .. [23:36:48] jgage: YuviPanda: m = m.replace(': -', ':') # Combine separators [23:36:53] ^ looks very much like it [23:37:00] no [23:37:51] my money's on this one [23:37:52] m = re.sub(r'(\w+): \1:?', r'\1', m) [23:37:57] the parenthesis moved [23:38:05] also i don't understand it ;) [23:39:50] It turns "FOO: FOO:" into "FOO" [23:40:02] Speaking of which, The Parenthesis Moved would be a good title for a techno-horror movie. [23:40:16] mutante jgage well, I tihnk it should’ve been fixed instead of reverted [23:40:17] but oh well [23:40:21] * YuviPanda doesnt’ care enough [23:40:24] i've never seen \1 in the search part before, tricksy. [23:40:47] jgage: it’s just a backreference [23:41:09] yeah, just never seen it in the search rather than replace [23:41:21] oh [23:41:26] !log reverted labvirt1001 to 3.13.0-49-generic because 3.16 wouldn’t mount the fs [23:41:26] it’s used for matching repetitions I think [23:41:30] it would have horrible perfomance [23:41:30] Logged the message, Master [23:41:33] !log updating labvirt1002 to 3.13.0-49-generic, dist-upgrade, rebooting [23:41:36] Logged the message, Master [23:41:39] jouncebot: next [23:41:39] heh bd808 [23:41:42] no? [23:41:43] well [23:41:45] bd808: no idea then :| [23:42:09] YuviPanda: :( I might poke at it after dinner [23:42:13] bd808: ok [23:42:27] grrr [23:42:29] jouncebot_: refresh [23:42:30] I refreshed my knowledge about deployments. [23:42:36] jouncebot_: next [23:42:40] well [23:42:40] ok [23:42:53] grrrit-wm is also dead [23:42:55] * YuviPanda restarts that [23:43:19] so many bots [23:43:23] we need a bot to restart dead bots [23:43:26] PROBLEM - salt-minion processes on labvirt1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:43:40] imho if something breaks the output it can be reverted without having to have the fix [23:43:42] jgage: actually... not the worst idead I've ever heard [23:43:49] :D [23:44:00] (03PS1) 10Dzahn: Revert "Revert "Make ircecho deduplicate statuses on all lines in buffer"" [puppet] - 10https://gerrit.wikimedia.org/r/206333 [23:44:07] http://learnyousomeerlang.com/supervisors [23:44:10] supervisor trees [23:44:38] got you covered [23:44:56] RECOVERY - salt-minion processes on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:45:23] I was thinking a bot in -labs that you could tell to restart a dead tool [23:45:41] But that presupposes that we know how to start random tool X [23:45:57] that would be nifty, but i think the bots are sufficiently important that they should be made more resilient instead [23:46:10] !log mattflaschen Synchronized wmf-config/CommonSettings.php: Bump Flow cache version to 4.7 (1e28cf78e64eb860d6eade775abae43d11c1dd75) (duration: 00m 16s) [23:46:14] Logged the message, Master [23:46:27] ori: agreed actually [23:46:38] (03CR) 10Dzahn: [C: 032] "everybody cares just enough not to" [puppet] - 10https://gerrit.wikimedia.org/r/206333 (owner: 10Dzahn) [23:46:42] * bd808 tried to found a tools team last week [23:46:59] yes, but the supervision tree wasn't clear ;) [23:47:06] PROBLEM - Host labvirt1002 is DOWN: PING CRITICAL - Packet loss = 100% [23:47:09] heh [23:47:31] and the WorkerFactoryFactory was not found [23:48:30] mutante: i need a twitter stream of your review comments [23:49:49] this one was great, i also liked: "the comments are a bit confusing. it's like you guys agree on _not_ merging this but also there is a +1" [23:50:15] RECOVERY - Host labvirt1002 is UPING OK - Packet loss = 0%, RTA = 2.01 ms [23:51:11] btw uping is a verb [23:51:44] mutante: https://gerrit.wikimedia.org/r/#/c/201924/ too :) [23:52:07] Okay, rmoen put the Flow patch on testwiki, so we're going to be doing a test LQT->Flow conversion there (we were going to start on testwiki anyway). [23:52:11] ori: it will all be lost because "import gerrit comment history in diffusion" will be rejected :/ [23:52:51] rmoen, you can go ahead and do the sync-dir now, though. It should only affect import code, tests, and maintenance scripts. [23:53:07] ok [23:54:36] mattflaschen: how much fun is it to say Bump Flow ? [23:54:42] !log rmoen Synchronized php-1.26wmf3/extensions/Flow: Bump flow for cherry-pick (duration: 00m 23s) [23:54:43] awesome [23:54:48] Logged the message, Master [23:55:10] rmoen, probably not as much fun as calling mobile web pair programming Gather Round. [23:55:39] mattflaschen: :) [23:56:02] (03PS13) 10BBlack: Adding a Last-Access cookie to text and mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T88813) (owner: 10Nuria) [23:57:24] mattflaschen: marked swats as done [23:57:33] I ^ [23:58:40] rmoen: Are you all done? [23:58:50] RoanKattouw: yep [23:58:52] Darn it, it failed in WikimediaEvents, which of course I don't have enabled locally. [23:59:06] mattflaschen :(