[00:01:50] (03PS1) 10Dereckson: Enable SandboxLink on es.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206735 [00:07:33] (03PS1) 10Dereckson: Enable ShortUrl on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206736 (https://phabricator.wikimedia.org/T92820) [00:09:09] (03PS1) 10Dereckson: Enable ShortUrl on es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206737 (https://phabricator.wikimedia.org/T96668) [00:46:06] (03CR) 10Dereckson: "Yes, I write them fluidly in such kind of delay." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [00:52:48] (03CR) 10Dereckson: [C: 031] Enable assigning 'accountcreator' for newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206093 (https://phabricator.wikimedia.org/T96824) (owner: 10Mjbmr) [01:14:09] !log tstarling Started scap: deploying SecurePoll edit count maintenance script [01:19:54] (03CR) 10Dereckson: Create Wikipedia Konkani (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [01:24:33] !log creating bv2015_edits tables on all wikis [01:31:00] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 2 below the confidence bounds [01:43:57] !log on terbium: running extensions/SecurePoll/cli/wm-scripts/bv2015/populateEditCount.php on all wikis [02:12:22] !log tstarling Finished scap: deploying SecurePoll edit count maintenance script (duration: 58m 13s) [02:23:30] !log l10nupdate Synchronized php-1.26wmf2/cache/l10n: (no message) (duration: 09m 19s) [02:24:50] PROBLEM - RAID on snapshot1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:30] RECOVERY - RAID on snapshot1004 is OK no RAID installed [02:27:21] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [02:27:39] We have a problem with morebots [02:28:16] !log LocalisationUpdate completed (1.26wmf2) at 2015-04-27 02:27:13+00:00 [02:28:24] robh: ^ do you think you could kick it [02:33:40] he's likely not about [02:47:50] !log l10nupdate Synchronized php-1.26wmf3/cache/l10n: (no message) (duration: 09m 38s) [02:52:12] !log LocalisationUpdate completed (1.26wmf3) at 2015-04-27 02:51:08+00:00 [04:48:07] (03PS1) 10BBlack: Revert "Depool esams, planned upsteam network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/206752 [04:48:16] (03CR) 10BBlack: [C: 032 V: 032] Revert "Depool esams, planned upsteam network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/206752 (owner: 10BBlack) [04:49:03] !log repooling esams for prod traffic (maint over) [05:05:05] morebots, back? [05:05:05] I am a logbot running on tools-exec-06. [05:05:05] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [05:05:05] To log a message, type !log . [05:09:26] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Apr 27 05:08:23 UTC 2015 (duration 8m 22s) [05:09:34] Logged the message, Master [06:21:23] (03CR) 10Florianschmidtwelzow: Set $wgRateLimits['badcaptcha'] to counter bots (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [06:29:51] PROBLEM - puppet last run on mw2090 is CRITICAL puppet fail [06:30:31] PROBLEM - puppet last run on mw1177 is CRITICAL puppet fail [06:31:01] PROBLEM - puppet last run on lvs2004 is CRITICAL Puppet has 1 failures [06:31:10] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on db1043 is CRITICAL puppet fail [06:34:01] PROBLEM - puppet last run on mw2066 is CRITICAL Puppet has 1 failures [06:34:01] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures [06:34:51] PROBLEM - puppet last run on db1026 is CRITICAL Puppet has 1 failures [06:35:00] PROBLEM - puppet last run on ms-fe2003 is CRITICAL Puppet has 1 failures [06:35:12] PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 1 failures [06:35:20] PROBLEM - puppet last run on lvs2001 is CRITICAL Puppet has 1 failures [06:36:01] PROBLEM - puppet last run on labvirt1005 is CRITICAL Puppet has 1 failures [06:36:11] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:36:11] PROBLEM - puppet last run on mw1052 is CRITICAL Puppet has 1 failures [06:36:20] PROBLEM - puppet last run on mw1144 is CRITICAL Puppet has 2 failures [06:36:21] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures [06:36:22] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:36:30] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:36:30] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:36:31] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:44:31] PROBLEM - puppet last run on cp3013 is CRITICAL puppet fail [06:45:41] RECOVERY - puppet last run on mw2066 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:01] RECOVERY - puppet last run on lvs2004 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [06:46:20] RECOVERY - puppet last run on mw1052 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1144 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:31] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:47:02] RECOVERY - puppet last run on lvs2001 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:47:10] RECOVERY - puppet last run on mw1177 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:47:41] RECOVERY - puppet last run on labvirt1005 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:47:50] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:48:00] RECOVERY - puppet last run on db1043 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:10] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:11] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:11] RECOVERY - puppet last run on mw2090 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:48:11] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:48:20] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:51] RECOVERY - puppet last run on db1026 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:50] RECOVERY - puppet last run on cp3013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:17:04] (03CR) 10Qgil: Phab monthly stats email: Clarify what day values for priority mean (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/206515 (owner: 10Aklapper) [07:19:13] (03CR) 10Qgil: [C: 031] "Good idea, but I haven't tested the MySQL instruction." [puppet] - 10https://gerrit.wikimedia.org/r/206518 (owner: 10Aklapper) [07:27:50] 6operations, 6Commons, 6Multimedia, 7HHVM, 5Patch-For-Review: Create an HHVM 3.6.0 package, adding Tim's streaming patch - https://phabricator.wikimedia.org/T93194#1237513 (10Joe) the package is created and running since friday in beta on one host. If I don't see any really worrying sign of problems, I'... [07:39:22] 6operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1237525 (10Visdaviva) @Glaisher done uploaded a newer version of the logo. Let me know if there is anything else that I could help with. [08:00:11] 6operations, 6Commons, 6Multimedia, 7HHVM, 5Patch-For-Review: Create an HHVM 3.6.0 package, adding Tim's streaming patch - https://phabricator.wikimedia.org/T93194#1237573 (10Joe) In deployment-mediawiki01 logs I find a few occurrences of core dumps, and inspecting those I find ``` message: Attempted to... [08:19:40] !log ms-be101[678] object weight to 3000 [08:19:46] Logged the message, Master [08:51:09] what's the policy on self +2ing config changes that affect labs? context: https://gerrit.wikimedia.org/r/#/c/206375/ [08:54:25] phuedx, just do it and pull on tin [08:54:43] hey MaxSem [08:54:45] ta [09:13:31] MaxSem: does the config get pushed to labs automatically then? [09:26:52] < still a noob [09:33:47] (03PS3) 10Phuedx: Enable Browse experiment on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206375 (https://phabricator.wikimedia.org/T94739) [09:39:08] 6operations, 10MediaWiki-Debug-Logging, 6Release-Engineering, 6Security-Team, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1237716 (10fgiunchedi) ok even with unsampled xff fluorine grows at ~12G/day with ~800G free, if we're short on space again we can e... [09:39:38] 6operations, 6Mobile-Apps, 6Services: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1237718 (10akosiaris) While the above are being solved out, there is some information that needs to be provided for this to be deployed in production. While the process for p... [09:41:08] (03PS2) 10Alexandros Kosiaris: Assign LVS IPs to the graphoid service [dns] - 10https://gerrit.wikimedia.org/r/205856 (https://phabricator.wikimedia.org/T90487) [09:45:03] (03PS4) 10Phuedx: Enable Browse experiment on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206375 (https://phabricator.wikimedia.org/T94739) [09:45:06] 6operations: Define and implement an automated process to ease the introduction of a new service into production - https://phabricator.wikimedia.org/T97036#1237723 (10akosiaris) [09:45:08] 6operations, 6Services: Define and then implement a way for a future service owner to provide the info required to have a new service brought into production - https://phabricator.wikimedia.org/T97031#1237722 (10akosiaris) [09:45:12] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1237721 (10akosiaris) [10:26:36] 6operations, 10ops-eqiad: reclaim tungsten as spare - https://phabricator.wikimedia.org/T97274#1237764 (10fgiunchedi) 3NEW [10:26:55] 6operations, 7Graphite, 5Patch-For-Review: backfill metrics from tungsten to graphite1001 - https://phabricator.wikimedia.org/T90591#1237773 (10fgiunchedi) 5Open>3Resolved I double checked metrics left on tungsten, the initial sync to new hardware migrated all relevant metrics so no need to backfill old... [10:26:56] 6operations, 7Graphite: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#1237777 (10fgiunchedi) [11:23:00] (03PS1) 10Dereckson: WIP: prevent new wikis to use Graph: namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206776 [11:25:44] (03PS1) 10Dereckson: Enable Graph extension on sv.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206777 (https://phabricator.wikimedia.org/T97027) [11:29:19] (03PS1) 10Filippo Giunchedi: statsite: enable extended counters by default [puppet] - 10https://gerrit.wikimedia.org/r/206781 (https://phabricator.wikimedia.org/T95703) [11:30:22] 6operations, 7Graphite, 5Patch-For-Review: Counters now only provide rates (multiplied by 1000?) - https://phabricator.wikimedia.org/T95703#1237873 (10fgiunchedi) I'll be enabling extended counters tomorrow with https://gerrit.wikimedia.org/r/206781 and will be renaming the metrics on the graphite/puppet sid... [11:32:19] (03CR) 10Yurik: WIP: prevent new wikis to use Graph: namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206776 (owner: 10Dereckson) [11:35:35] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1237879 (10faidon) To elaborate: I'm worried that of the repercussions this would have for non-SPDY clients, as they wouldn't be able to use a different set of connections to fetch those res... [11:36:10] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [11:36:25] (03CR) 10Dereckson: WIP: prevent new wikis to use Graph: namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206776 (owner: 10Dereckson) [11:42:20] (03PS1) 10Filippo Giunchedi: gdash: improve graphite network dashboard [puppet] - 10https://gerrit.wikimedia.org/r/206783 [11:42:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: improve graphite network dashboard [puppet] - 10https://gerrit.wikimedia.org/r/206783 (owner: 10Filippo Giunchedi) [11:45:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Two inline comments. Otherwise LGTM and pretty much ready for merge" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/206105 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [11:51:21] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [11:51:41] 6operations, 5Patch-For-Review: adjust CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1237902 (10fgiunchedi) yeah that's true, I think the confusion comes from the fact that notifications are not sent but the alarm still shows when looking for critical/warning/unknown alerts in icinga's def... [11:51:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor inline issue, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/206106 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [11:54:46] (03CR) 10Alexandros Kosiaris: [C: 031 V: 032] "LGTM, will merge after https://gerrit.wikimedia.org/r/#/c/206105/3 and https://gerrit.wikimedia.org/r/#/c/206106/1 are merged." [puppet] - 10https://gerrit.wikimedia.org/r/206108 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [12:05:47] (03PS1) 10Dereckson: Added delwedd.llgc.org.uk to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206786 (https://phabricator.wikimedia.org/T97281) [12:07:06] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 22 hours old. [12:08:47] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [12:10:50] (03CR) 10Dereckson: [C: 04-1] "Hold on. Exact domain is still discussed in the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206786 (https://phabricator.wikimedia.org/T97281) (owner: 10Dereckson) [12:30:19] (03PS8) 10Merlijn van Deen: Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/206118 (https://phabricator.wikimedia.org/T96898) [12:31:53] !log upgrading pfw-eqiad to newer junos [12:32:02] Logged the message, Master [12:39:12] (03PS14) 10BBlack: Adding a Last-Access cookie to text and mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T88813) (owner: 10Nuria) [12:48:37] (03PS2) 10Dereckson: Added delwedd.llgc.org.uk to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206786 (https://phabricator.wikimedia.org/T97281) [12:49:04] (03CR) 10Merlijn van Deen: [C: 04-1] "Could we place it in init.pp to remove the code duplication?" [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt) [12:51:10] (03CR) 10BBlack: "PS14 wraps the recv bit in req.restarts==0 (common pattern to avoid extra work on request restart) - retested on betalabs and still works " [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T88813) (owner: 10Nuria) [12:51:58] (03CR) 10BBlack: [C: 032] Adding a Last-Access cookie to text and mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T88813) (owner: 10Nuria) [12:52:32] PROBLEM - puppet last run on mw1149 is CRITICAL Puppet has 1 failures [12:55:12] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 22 hours old. [12:56:48] uhoh, that's not good [12:56:53] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [12:57:01] heh [12:57:16] (03PS3) 10Dereckson: Added *.llgc.org.uk to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206786 (https://phabricator.wikimedia.org/T97281) [12:57:48] so we're in sync for official-jessie? [12:58:01] sure [12:58:08] should be :) [12:59:21] (03CR) 10Mobrovac: Graphoid: service deployment on SCA (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/206105 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [12:59:28] it's a little nerve-wracking how often systemd packages have updated over the past month or two (including one somewhere between "last week" and "official release yesterday") [12:59:34] (03PS4) 10Mobrovac: Graphoid: service deployment on SCA [puppet] - 10https://gerrit.wikimedia.org/r/206105 (https://phabricator.wikimedia.org/T90487) [13:00:17] (03CR) 10jenkins-bot: [V: 04-1] Graphoid: service deployment on SCA [puppet] - 10https://gerrit.wikimedia.org/r/206105 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [13:00:24] damn [13:01:15] (03CR) 10Merlijn van Deen: [C: 031] "/data/project/.system/store/hostkey-toolsbeta-puppetmaster3.eqiad.wmflabs 2015-04-08 14:43:16.401294339 +0000" [puppet] - 10https://gerrit.wikimedia.org/r/196125 (https://phabricator.wikimedia.org/T92379) (owner: 10Tim Landscheidt) [13:01:22] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - pfw2-eqiad:xe-6/0/0 {#2954} [10Gbps DF]BR [13:02:06] (03PS5) 10Mobrovac: Graphoid: service deployment on SCA [puppet] - 10https://gerrit.wikimedia.org/r/206105 (https://phabricator.wikimedia.org/T90487) [13:02:28] paravoid: yeah I've seen a few emails from ftpsync over the weekend too, jessie release related perhaps [13:02:42] bblack: it's cherry-picked fixes, basically [13:02:57] I've seen a few of them, they're all very reasonable [13:03:02] RECOVERY - Router interfaces on cr2-eqiad is OK host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 [13:03:48] ok [13:07:03] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw1-eqiad:xe-6/0/0 {#2952} [10Gbps DF]BR [13:07:09] (03CR) 10Mobrovac: Graphoid: LVS configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/206106 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [13:07:18] (03PS2) 10Mobrovac: Graphoid: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/206106 (https://phabricator.wikimedia.org/T90487) [13:08:43] RECOVERY - Router interfaces on cr1-eqiad is OK host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [13:09:35] (03PS2) 10Mobrovac: Graphoid: Varnish configuration [puppet] - 10https://gerrit.wikimedia.org/r/206108 (https://phabricator.wikimedia.org/T90487) [13:09:43] RECOVERY - puppet last run on mw1149 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:14:22] PROBLEM - DPKG on cp3005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:19:02] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 23 hours old. [13:20:53] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [13:28:41] !log upgrading pfw-codfw to newer junos [13:28:44] Logged the message, Master [13:32:40] (03PS2) 10Dereckson: Prevent new wikis to use Graph: namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206776 [13:34:37] (03CR) 10Dereckson: "PS2: per Yurik comment, don't add anything in the configuration if no namespace is needed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206776 (owner: 10Dereckson) [13:34:44] (03PS1) 10Filippo Giunchedi: Revert "eventlogging: adjust counters thresholds" [puppet] - 10https://gerrit.wikimedia.org/r/206797 (https://phabricator.wikimedia.org/T95703) [13:35:27] (03PS2) 10Dereckson: Enable Graph extension on sv.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206777 (https://phabricator.wikimedia.org/T97027) [13:35:40] (03CR) 10Filippo Giunchedi: [C: 04-1] "to be merged tomorrow after I877e9914a7" [puppet] - 10https://gerrit.wikimedia.org/r/206797 (https://phabricator.wikimedia.org/T95703) (owner: 10Filippo Giunchedi) [13:36:16] (03CR) 10Filippo Giunchedi: [C: 04-1] "to be merged tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/206781 (https://phabricator.wikimedia.org/T95703) (owner: 10Filippo Giunchedi) [13:40:13] PROBLEM - Router interfaces on cr1-codfw is CRITICAL host 208.80.153.192, interfaces up: 114, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw-codfw:xe-6/0/0 {#10900} [10Gbps DF]BR [13:42:02] RECOVERY - Router interfaces on cr1-codfw is OK host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [13:46:23] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [13:51:12] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [13:57:43] PROBLEM - puppet last run on cp3021 is CRITICAL puppet fail [13:57:53] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [14:01:43] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [14:01:52] RECOVERY - DPKG on cp3005 is OK: All packages OK [14:02:22] PROBLEM - Router interfaces on cr2-codfw is CRITICAL host 208.80.153.193, interfaces up: 110, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw-codfw:xe-15/0/0 {#10901} [10Gbps DF]BR [14:03:10] (03PS1) 10KartikMistry: Added initial Debian package for apertium-en-gl [debs/contenttranslation/apertium-en-gl] - 10https://gerrit.wikimedia.org/r/206803 (https://phabricator.wikimedia.org/T96654) [14:04:10] !log puppet disabled on caches while apt upgrades run... [14:04:14] Logged the message, Master [14:07:22] PROBLEM - puppet last run on analytics1011 is CRITICAL Puppet last ran 2 days ago [14:07:23] PROBLEM - puppet last run on analytics1013 is CRITICAL Puppet last ran 2 days ago [14:07:46] (just reenabled puppet there, forgot I left it off on friday :/) [14:07:52] PROBLEM - puppet last run on analytics1041 is CRITICAL Puppet last ran 2 days ago [14:10:42] RECOVERY - puppet last run on analytics1013 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [14:14:02] RECOVERY - puppet last run on analytics1011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:14:47] 6operations, 6Mobile-Apps, 6Services: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1238116 (10mobrovac) [14:17:10] (03PS1) 10KartikMistry: Added initial Debian packaging for apertium-es-gl [debs/contenttranslation/apertium-es-gl] - 10https://gerrit.wikimedia.org/r/206805 (https://phabricator.wikimedia.org/T96654) [14:22:33] PROBLEM - Router interfaces on cr1-codfw is CRITICAL host 208.80.153.192, interfaces up: 114, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw-codfw:xe-6/0/0 {#10900} [10Gbps DF]BR [14:24:13] RECOVERY - Router interfaces on cr1-codfw is OK host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [14:24:13] RECOVERY - Router interfaces on cr2-codfw is OK host 208.80.153.193, interfaces up: 112, down: 0, dormant: 0, excluded: 0, unused: 0 [14:24:34] RECOVERY - puppet last run on analytics1041 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [14:24:42] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:26:57] (03PS1) 10KartikMistry: Added initial Debian package for apertium-pt-gl [debs/contenttranslation/apertium-pt-gl] - 10https://gerrit.wikimedia.org/r/206806 (https://phabricator.wikimedia.org/T96654) [14:29:43] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [14:30:43] (03PS1) 10Merlijn van Deen: Puppet cron: Mail last 50 lines of log on error [puppet] - 10https://gerrit.wikimedia.org/r/206807 (https://phabricator.wikimedia.org/T96122) [14:35:08] (03PS4) 10RobH: mholloway granted access as releaser-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/205917 (https://phabricator.wikimedia.org/T96886) [14:36:34] (03CR) 10RobH: [C: 032] mholloway granted access as releaser-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/205917 (https://phabricator.wikimedia.org/T96886) (owner: 10RobH) [14:38:19] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1238147 (10BBlack) I think @ori 's contention last we discussed this on IRC was that the current count of $domain + bits connections in a typical page load (combined) is under the threshold... [14:38:29] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to caesium for Michael Holloway - https://phabricator.wikimedia.org/T96886#1238148 (10RobH) 5Open>3Resolved This is now merged live. @Mholloway, if you have any issues, let me know. The change will take up to 30 minutes to propaga... [14:41:04] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I am not very fond of this. I 'd prefer a >/dev/null 2>&1." [puppet] - 10https://gerrit.wikimedia.org/r/206807 (https://phabricator.wikimedia.org/T96122) (owner: 10Merlijn van Deen) [14:41:39] 10Ops-Access-Requests, 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1238151 (10RobH) If this has shifted entirely from an acess request, can one of you guys better rename the task and remove the ops-access-requests project? (I rather just not pull the... [14:42:00] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1238152 (10GWicke) [14:43:28] 6operations, 7Wikimedia-log-errors: internal_api_error_Exception: [22e05a83] Exception Caught: wfDiff(): popen() failed errors on English Wikipedia - https://phabricator.wikimedia.org/T97145#1238159 (10Aklapper) @greg: Probably fixed as per anomie's last comment but still something to keep an eye on. [14:43:47] (03CR) 10Merlijn van Deen: "Fair enough. The >/dev/null option has the issue that no mails will be sent out at all, of course. I suppose that's OK given that the moni" [puppet] - 10https://gerrit.wikimedia.org/r/206807 (https://phabricator.wikimedia.org/T96122) (owner: 10Merlijn van Deen) [14:47:18] (03PS2) 10Merlijn van Deen: Puppet cron: Silence e-mails [puppet] - 10https://gerrit.wikimedia.org/r/206807 (https://phabricator.wikimedia.org/T96122) [14:47:47] (03CR) 10Alexandros Kosiaris: "Exactly. We anyway have not relied on that specific cron generated email for a long time now to catch these errors but rather on icinga ch" [puppet] - 10https://gerrit.wikimedia.org/r/206807 (https://phabricator.wikimedia.org/T96122) (owner: 10Merlijn van Deen) [14:48:09] (03CR) 10Alexandros Kosiaris: [C: 032] Puppet cron: Silence e-mails [puppet] - 10https://gerrit.wikimedia.org/r/206807 (https://phabricator.wikimedia.org/T96122) (owner: 10Merlijn van Deen) [14:50:25] !log upgrade statsite on graphite1001 [14:50:30] Logged the message, Master [14:51:28] ottomata1: your "puppet agent -t" salt regex is bad, it hit everything everywhere I think? [14:52:06] (because the alternation starts out with a leading |, meaning or-anything) [14:52:42] 6operations, 10Analytics-Cluster: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1238209 (10Ottomata) 3NEW a:3Ottomata [14:53:03] PROBLEM - configured eth on ganeti1001 is CRITICAL: tap0 reporting no carrier. [14:53:31] 6operations, 10Analytics-Cluster: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1238218 (10Ottomata) [14:53:34] 6operations, 10Analytics-Cluster: Set up ops kafkatee instance as part of udp2log transition - https://phabricator.wikimedia.org/T96616#1238217 (10Ottomata) [14:53:56] 6operations, 10Wikimedia-SVG-rendering: Install PT (paratype) font on image scalars - https://phabricator.wikimedia.org/T97181#1238228 (10Aklapper) p:5Triage>3Low [14:56:11] (03PS1) 10Ottomata: Turn off oxygen's udp2log instance, move 5xx error log to erbium, turn off some erbium filters [puppet] - 10https://gerrit.wikimedia.org/r/206810 (https://phabricator.wikimedia.org/T97294) [14:58:58] bblack [14:59:12] eh? yes, i think i remmeber accidentally running that and ctrl-cing it as soon as I saw that [14:59:15] (03PS2) 10Ottomata: Turn off oxygen's udp2log instance, move 5xx error log to erbium, turn off some erbium filters [puppet] - 10https://gerrit.wikimedia.org/r/206810 (https://phabricator.wikimedia.org/T97294) [14:59:16] that was last week though, right? [14:59:18] it's still running [14:59:21] ! [14:59:22] no [14:59:38] i ran the command last week [14:59:46] how do I see it? [15:00:04] Hi. [15:00:04] manybubbles, anomie, ^d, thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150427T1500). [15:00:47] via salt-run [15:00:53] the jid is 20150424185036329800 so I guess it was last week [15:01:27] primarily I just didn't want you to copypasta that regex and use it again, because there might be things you'd do to analytics* that would end very badly if they hit all prod nodes :P [15:02:06] ok, still not sure how to use salt-run to see this, can I kill it? [15:02:12] Dereckson: looks like you're it this morning. Also, looks like I may be it this morning as far as deployers go :P [15:02:21] there are docs :) [15:02:28] i am reading man! [15:02:31] 'morning thcipriani [15:02:42] ah -d? [15:02:56] Dereckson: g'morning—any particular order for these patches? [15:03:08] * thcipriani looking through now [15:03:38] got it... [15:04:26] Not, excepted 199321 depends of 206650 [15:04:33] PROBLEM - DPKG on cp1065 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:04:33] PROBLEM - DPKG on cp1063 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:04:52] PROBLEM - DPKG on cp1066 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:04:53] PROBLEM - DPKG on cp1064 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:05:05] bblack, did you kill it? i can't find that jid [15:05:23] nope [15:05:32] PROBLEM - DPKG on cp1068 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:05:47] also, that left a stuck puppet (from ctrl-c? donno) on lvs4001, which has disable agent there all weekend [15:06:02] PROBLEM - DPKG on cp1067 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:06:36] weird. [15:06:45] (03CR) 10Thcipriani: [C: 032] Set $wgEnotifMinorEdits to true on wikimania2016.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206648 (https://phabricator.wikimedia.org/T96564) (owner: 10Dereckson) [15:06:50] (03Merged) 10jenkins-bot: Set $wgEnotifMinorEdits to true on wikimania2016.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206648 (https://phabricator.wikimedia.org/T96564) (owner: 10Dereckson) [15:07:06] bblack do you still see that job? i'm doing [15:07:10] salt-run jobs.list_jobs [15:07:21] the only puppet one i see is a --disable from you [15:07:25] '20150427140351651475': [15:07:25] Arguments: [15:07:26] - puppet agent --disable [15:07:40] did you use batching? [15:08:02] in any case, I killed the child proc on lvs4001, that may have ended it [15:08:13] hm, no bblack, i just did cmd.run [15:08:17] ok [15:08:17] thanks [15:08:22] RECOVERY - puppet last run on lvs4001 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:08:32] (03PS3) 10Ottomata: Turn off oxygen's udp2log instance, move 5xx error log to erbium, turn off some erbium filters [puppet] - 10https://gerrit.wikimedia.org/r/206810 (https://phabricator.wikimedia.org/T97294) [15:08:53] I think list_jobs isn't what you want to look at anyways, you want jobs.active [15:09:37] then again, all things salt-related seem to be non-deterministic in the first place :P [15:09:48] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:206648]] (duration: 00m 51s) [15:09:52] Logged the message, Master [15:10:03] wow there's a really old tail one from me too, that was probably ctrl-ced [15:10:14] ^ Dereckson , please check 206648 [15:11:45] (03CR) 10Thcipriani: [C: 032] Removed legacy groups on fr.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206647 (https://phabricator.wikimedia.org/T90979) (owner: 10Dereckson) [15:11:50] (03Merged) 10jenkins-bot: Removed legacy groups on fr.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206647 (https://phabricator.wikimedia.org/T90979) (owner: 10Dereckson) [15:15:19] (03PS1) 10Filippo Giunchedi: gdash: expand graphite dashboard colors [puppet] - 10https://gerrit.wikimedia.org/r/206813 [15:15:50] (03CR) 10Ottomata: [C: 032] Turn off oxygen's udp2log instance, move 5xx error log to erbium, turn off some erbium filters [puppet] - 10https://gerrit.wikimedia.org/r/206810 (https://phabricator.wikimedia.org/T97294) (owner: 10Ottomata) [15:16:35] Dereckson: everything ok? Paused on the other patches. [15:17:15] phew, ok, no active jobs now, thanks bblack [15:17:23] I guess it's okay. I've pinged nemo to test that change with him, as we need two contributors for that, but he doesn't seem to be here. [15:17:41] (03PS2) 10Filippo Giunchedi: gdash: expand graphite dashboard colors [puppet] - 10https://gerrit.wikimedia.org/r/206813 [15:18:06] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: expand graphite dashboard colors [puppet] - 10https://gerrit.wikimedia.org/r/206813 (owner: 10Filippo Giunchedi) [15:18:08] But, yes, it's okay. In user preferences, that looks good. [15:18:11] Dereckson: kk, going ahead with the others. [15:19:05] Are there any cases where a blind addition of an @ like this will do something unintended? https://gerrit.wikimedia.org/r/#/c/206720/ [15:20:01] !log thcipriani Synchronized wmf-config/flaggedrevs.php: SWAT [[gerrit:206647]] (duration: 00m 14s) [15:20:05] Logged the message, Master [15:20:16] Testing. [15:20:30] 206647 looks good. [15:22:33] (03CR) 10Thcipriani: [C: 032] Add draft namespace for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204467 (https://phabricator.wikimedia.org/T92760) (owner: 10Mjbmr) [15:22:46] greg-g: Hi... I would like to deploy Wikibase changes later today (didn't make it until SWAT because we no one seems to be around for review today :S) [15:23:59] 6operations, 6Services, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1238297 (10mobrovac) ``` swagger: 2.0 info: version: title: x-default-params: title: Foobar paths: /page: get: tags: [a, b, c] x-monitor: false /page/h... [15:24:30] (03Merged) 10jenkins-bot: Add draft namespace for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204467 (https://phabricator.wikimedia.org/T92760) (owner: 10Mjbmr) [15:27:18] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:204467]] (duration: 00m 29s) [15:27:24] Logged the message, Master [15:27:46] Testing. [15:28:06] (03CR) 10Alexandros Kosiaris: [C: 032] Graphoid: service deployment on SCA [puppet] - 10https://gerrit.wikimedia.org/r/206105 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [15:29:06] (03PS1) 10Ottomata: Remove oxygen udp2log related code [puppet] - 10https://gerrit.wikimedia.org/r/206815 (https://phabricator.wikimedia.org/T97294) [15:29:12] (03CR) 10jenkins-bot: [V: 04-1] Remove oxygen udp2log related code [puppet] - 10https://gerrit.wikimedia.org/r/206815 (https://phabricator.wikimedia.org/T97294) (owner: 10Ottomata) [15:29:22] 204467 tested, looks good to me. [15:29:30] kk [15:30:23] (03CR) 10Thcipriani: [C: 032] Added *.llgc.org.uk to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206786 (https://phabricator.wikimedia.org/T97281) (owner: 10Dereckson) [15:31:01] 10Ops-Access-Requests, 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1238317 (10bd808) >>! In T89222#1235884, @GWicke wrote: > Also adding @bd808, as he might have an idea for shipping plain log files to logstash. Ideally we would find a way to configu... [15:31:42] (03PS2) 10Filippo Giunchedi: graphite: stop system carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/206127 [15:33:03] (03PS2) 10Ottomata: Remove oxygen udp2log related code [puppet] - 10https://gerrit.wikimedia.org/r/206815 (https://phabricator.wikimedia.org/T97294) [15:35:00] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 1 failures [15:35:30] 10Ops-Access-Requests, 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1238326 (10GWicke) @RobH, the argument @faidon and @akosiaris brought up was that access for log reading purposes will no longer be needed once we figure out some way to ship of aperti... [15:35:56] (03PS2) 10Aaron Schulz: Set $wgJobSerialCommitThreshold to .1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206451 [15:36:07] Dereckson: hmm zuul seems upset [15:36:14] * thcipriani looks [15:36:18] what did I do? [15:36:20] (03CR) 10Andrew Bogott: [C: 032] Use @resolver instead of resolver. [puppet] - 10https://gerrit.wikimedia.org/r/206720 (owner: 10Andrew Bogott) [15:36:26] with the two flagged rev patches? [15:36:46] oh er yes [15:36:58] 206786 depends of another change [15:37:44] which is 206727 [15:38:10] Could we deploy 206727 too? [15:38:12] ah, gotcha [15:39:17] well, if we want to get the other patch out, I suppose we'd better. [15:39:43] (03CR) 10Thcipriani: [C: 032] Added unitedarchives.noip.me to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206727 (https://phabricator.wikimedia.org/T96664) (owner: 10Dereckson) [15:39:49] (03Merged) 10jenkins-bot: Added unitedarchives.noip.me to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206727 (https://phabricator.wikimedia.org/T96664) (owner: 10Dereckson) [15:39:52] (03Merged) 10jenkins-bot: Added *.llgc.org.uk to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206786 (https://phabricator.wikimedia.org/T97281) (owner: 10Dereckson) [15:41:58] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:206727]] and [[gerrit:206786]] (duration: 00m 16s) [15:42:03] Logged the message, Master [15:42:03] Testing. [15:44:27] PROBLEM - graphoid on sca1001 is CRITICAL: Connection refused [15:44:31] Meh. [15:44:33] (Cannot access the database: Can't connect to MySQL server on '10.64.16.27' (4) (10.64.16.27)) [15:44:46] (transient) [15:45:30] 10Ops-Access-Requests, 6operations: Create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1238361 (10RobH) Since two opsen have stated it should go to logstash, and NOT have access granted for this, it seems this is a rejected access request. It seems mean to simply close... [15:46:37] 6operations: export logs to logstash or create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1238367 (10RobH) [15:47:28] 6operations, 10Wikimedia-Logstash: Select a standard log shipping solution to use with applications that cannot be configured to send log events directly to Logstash and/or fluorine - https://phabricator.wikimedia.org/T97297#1238370 (10bd808) 3NEW [15:47:30] 206786 tested [15:48:02] 206727 tested [15:48:05] (03PS3) 10Ottomata: Remove oxygen udp2log related code [puppet] - 10https://gerrit.wikimedia.org/r/206815 (https://phabricator.wikimedia.org/T97294) [15:48:28] Dereckson: actually I don't see the preference in https://wikimania2016.wikimedia.org/wiki/Special:Preferences [15:48:35] 6operations: export logs to logstash or create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#1030183 (10bd808) I have created T97297 to track selecting a log shipper. Discussion is welcome there on the general topic. [15:49:08] (03CR) 10Thcipriani: [C: 032] Fixed whitespace issues in flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206650 (owner: 10Dereckson) [15:49:14] (03Merged) 10jenkins-bot: Fixed whitespace issues in flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206650 (owner: 10Dereckson) [15:51:36] !log thcipriani Synchronized wmf-config/flaggedrevs.php: SWAT [[gerrit:206650]] no-op whitespace changes (duration: 00m 22s) [15:51:42] Logged the message, Master [15:52:06] (03PS4) 10Ottomata: Remove oxygen udp2log related code [puppet] - 10https://gerrit.wikimedia.org/r/206815 (https://phabricator.wikimedia.org/T97294) [15:53:18] (03CR) 10Ottomata: [C: 032] Remove oxygen udp2log related code [puppet] - 10https://gerrit.wikimedia.org/r/206815 (https://phabricator.wikimedia.org/T97294) (owner: 10Ottomata) [15:53:26] (03CR) 10Thcipriani: [C: 032] Give patrol to reviewers for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199321 (https://phabricator.wikimedia.org/T93798) (owner: 10Cenarium) [15:53:32] (03Merged) 10jenkins-bot: Give patrol to reviewers for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199321 (https://phabricator.wikimedia.org/T93798) (owner: 10Cenarium) [15:55:30] !log thcipriani Synchronized wmf-config/flaggedrevs.php: SWAT [[gerrit:199321]] (duration: 00m 17s) [15:55:35] Logged the message, Master [15:57:12] Nemo_bis: I've the same preferences at https://pt.wikibooks.org/wiki/Especial:Prefer%C3%AAncias#mw-prefsection-echo and https://wikimania2016.wikimedia.org/wiki/Special:Preferences#mw-prefsection-echo the first having also this setting. Give me a test page you watch and I can make a minor edit on, so you can check if you receive the notification. [15:58:41] (03PS1) 10Filippo Giunchedi: statsite: emit events in statsitectl [puppet] - 10https://gerrit.wikimedia.org/r/206819 [15:59:12] 199321 tested. [15:59:29] Dereckson: kk, last one, incoming [16:00:06] (03CR) 10Thcipriani: [C: 032] Change project name to 'Wikipedia' at astwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201897 (https://phabricator.wikimedia.org/T94341) (owner: 10Glaisher) [16:00:52] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1238421 (10GWicke) @bblack, I haven't seen anything explicitly stating that `Retry-After` is global to a service. The only thing hinting at that is afaik `50... [16:00:57] (03Merged) 10jenkins-bot: Change project name to 'Wikipedia' at astwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201897 (https://phabricator.wikimedia.org/T94341) (owner: 10Glaisher) [16:02:53] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:201897]] (duration: 00m 22s) [16:02:56] Logged the message, Master [16:03:18] Namespace works. [16:03:34] Dereckson: thanks! [16:03:35] Site title works too, after a page purge. Tested. [16:03:50] Dereckson: you can see the checkbox at https://meta.wikimedia.org/wiki/Special:Preferences [16:04:08] (03PS2) 10KartikMistry: Added initial Debian package for apertium-en-gl [debs/contenttranslation/apertium-en-gl] - 10https://gerrit.wikimedia.org/r/206803 (https://phabricator.wikimedia.org/T96654) [16:04:17] hey akosiaris, yt? [16:04:20] copper build machine q. [16:04:55] Nemo_bis: in notifications tab? [16:06:57] PROBLEM - puppet last run on ms-be2006 is CRITICAL puppet fail [16:08:13] thcipriani: by the way, we have a new extension to test in the beta cluster, Josa. greg-g has agreed for such deployment. Could we set a deployment window for Josa, at a moment Devunt is online (any weekday, between 18 and 22 UTC would seem the ideal)? [16:08:56] Dereckson: no, in the URL I gave [16:09:34] Nemo_bis: the Email me also for minor edits of pages and files? [16:11:21] Dereckson: for bete cluster you don't relaly need a window [16:11:27] beta, really, typing [16:12:09] so it can be deployed regardless of time? [16:12:37] greg-g: "bête (adj.) Not very bright and lacking in judgement; stupid; inept." :) [16:13:00] seems like an ok think to call the cluster [16:13:06] *thing [16:13:10] Dereckson: yes [16:13:17] bd808: :( [16:13:47] bd808: bête cluster isn't scheduled for another few quarters. [16:13:53] Nemo_bis: gotcha [16:13:55] devunt: the only annoying part is it'll need a merge in the puppet repo, but once that is done, it's a no-op for prod, so can be done "whenever [16:13:58] " [16:14:06] Nemo_bis: wikimania2016 and not wikimania2016wiki [16:14:20] ottomata: yeah [16:14:32] ottomata: sorry I was deep in trebuchet [16:14:35] how can I help ? [16:15:02] how can I push to the puppet? [16:15:11] I can't find any instruction documents [16:15:12] Dereckson: ouch; sorry, I had not had time to proofread carefully :( [16:16:27] (03PS1) 10KartikMistry: Added initial Debian package for apertium-tat [debs/contenttranslation/apertium-tat] - 10https://gerrit.wikimedia.org/r/206821 (https://phabricator.wikimedia.org/T95876) [16:16:47] !log boostrap cassandra on xenon [16:16:53] Logged the message, Master [16:17:25] (03PS1) 10Dereckson: Set $wgEnotifMinorEdits to true on wikimania2016.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206822 (https://phabricator.wikimedia.org/T96564) [16:18:06] (03CR) 10Dereckson: "s/wikimania2016/wikimania2016wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206648 (https://phabricator.wikimedia.org/T96564) (owner: 10Dereckson) [16:18:16] RECOVERY - Cassandra database on praseodymium is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [16:18:57] RECOVERY - Cassanda CQL query interface on praseodymium is OK: TCP OK - 0.052 second response time on port 9042 [16:19:36] thcipriani: as a follow-up of change 206648, could we deploy https://gerrit.wikimedia.org/r/#/c/206822? [16:19:36] RECOVERY - Cassanda CQL query interface on cerium is OK: TCP OK - 0.011 second response time on port 9042 [16:19:57] RECOVERY - Cassandra database on cerium is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [16:21:16] Dereckson: sure, let's get it out the door quickly here. [16:21:49] (03CR) 10Thcipriani: [C: 032] Set $wgEnotifMinorEdits to true on wikimania2016.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206822 (https://phabricator.wikimedia.org/T96564) (owner: 10Dereckson) [16:21:54] (03Merged) 10jenkins-bot: Set $wgEnotifMinorEdits to true on wikimania2016.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206822 (https://phabricator.wikimedia.org/T96564) (owner: 10Dereckson) [16:22:36] lol too fast for my +1 [16:24:00] akosiaris: nm, i was going to ask if build deps had to be installed on opper [16:24:02] copper [16:24:04] but i see the light! [16:24:05] the answer is no! [16:24:06] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: Extended SWAT [[gerrit:206822]] (duration: 00m 26s) [16:24:07] pbuilder is magic! [16:24:11] Logged the message, Master [16:24:17] ottomata: yup :-) [16:24:27] Nemo_bis: checkbox is now on https://wikimania2016.wikimedia.org/wiki/Special:Preferences#mw-prefsection-personal [16:24:31] thanks thcipriani [16:24:47] akosiaris: thank you so much for this. [16:24:49] Dereckson: yup—yw :) [16:24:57] i'm building a jessie kafkatee package [16:25:06] and the analytics labs project is full, couldn't create an instance [16:25:20] was going to either ask for more capacity, or try to make vagrant jessie...but copper is way better [16:25:22] that was so easy! [16:25:27] RECOVERY - puppet last run on ms-be2006 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:25:56] ottomata: happy to hear you like it :-) [16:26:28] the entire point of that work was making our lives easier [16:26:32] How close are we to needing a mediawiki-vagrant jessie option? Is there a plan to switch the MW servers to jessie? [16:27:23] bd808: not an articulated one at this point. It will happen at some point, but for now there is no urgent need [16:27:30] *nod* [16:27:59] akosiaris: how do you get packages from there to carbon? [16:28:06] PROBLEM - puppet last run on sca1002 is CRITICAL Puppet has 1 failures [16:28:11] not only that, but we haven't yet finished the transition to trusty and hhvm [16:28:11] forward key for a sec? copy to local and hten up to carbon? [16:28:41] ottomata: well, the one time I did it it was exactly how you described it [16:28:53] but I am thinking about an nginx on copper and autodirindex [16:29:09] then just wget them? [16:29:12] yes [16:29:14] aye [16:29:24] akosiaris: just do an rsync daemon module then [16:29:28] that will be easier for copying [16:29:35] that would work too, probably even better [16:29:49] ja, allow anyone to read from /var/cache/pbuilder/result [16:30:05] i can do that real quick, shall i throw up a patch? [16:30:21] sure [16:30:30] k... [16:31:15] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1238510 (10kevinator) 5Open>3declined this fix: {T91918} should alleviate the problem. [16:31:28] Dereckson: great :) [16:33:06] RECOVERY - puppet last run on sca1002 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:33:07] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:34:20] (03PS1) 10Ottomata: Set up rsync module on copper to allow easy copying of pbuilder built packages to carbon and elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/206828 [16:35:06] PROBLEM - configured eth on ganeti1001 is CRITICAL: tap0 reporting no carrier. [16:35:18] akosiaris: ^^ [16:35:57] PROBLEM - puppet last run on carbon is CRITICAL Puppet last ran 4 hours ago [16:37:37] RECOVERY - puppet last run on carbon is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:39:31] (03CR) 10Alexandros Kosiaris: [C: 04-1] "ferm and monitoring rules missing. Otherwise :-)" [puppet] - 10https://gerrit.wikimedia.org/r/206828 (owner: 10Ottomata) [16:39:37] 6operations, 10Wikimedia-Logstash: Select a standard log shipping solution to use with applications that cannot be configured to send log events directly to Logstash and/or fluorine - https://phabricator.wikimedia.org/T97297#1238534 (10bd808) The only things I have personally used before to do this are the for... [16:44:26] akosiaris: monitoring rules??? [16:44:33] i got your ferm coming, but whta montioring? [16:44:35] for the rsync daemon? [16:45:49] (03PS2) 10Ottomata: Set up rsync module on copper to allow easy copying of pbuilder built packages to carbon and elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/206828 [16:48:23] (03PS3) 10Aaron Schulz: Lowered innodb_lock_wait_timeout from defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206442 [16:51:46] (03PS3) 10Alexandros Kosiaris: Set up rsync module on copper [puppet] - 10https://gerrit.wikimedia.org/r/206828 (owner: 10Ottomata) [16:51:50] ottomata: ^ [16:52:24] (03CR) 10jenkins-bot: [V: 04-1] Set up rsync module on copper [puppet] - 10https://gerrit.wikimedia.org/r/206828 (owner: 10Ottomata) [16:53:17] akosiaris: missing closing } [16:53:18] but ok [16:53:25] lol [16:53:58] serves me right for not being careful enough [16:54:05] hehe [16:54:27] (03PS4) 10Alexandros Kosiaris: Set up rsync module on copper [puppet] - 10https://gerrit.wikimedia.org/r/206828 (owner: 10Ottomata) [16:54:42] 6operations, 10ops-eqiad, 10Incident-20141130-eqiad-C4: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1238590 (10faidon) @Cmjohnson, can you plug the switch somewhere on a console port? Also, if you're absolutely sure it's zero'ed (the above still has a "hostname" so it doesn't look... [16:55:01] (03CR) 10Aaron Schulz: [C: 032] Lowered innodb_lock_wait_timeout from defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206442 (owner: 10Aaron Schulz) [16:55:08] (03CR) 10jenkins-bot: [V: 04-1] Set up rsync module on copper [puppet] - 10https://gerrit.wikimedia.org/r/206828 (owner: 10Ottomata) [16:55:11] (03CR) 10Aaron Schulz: [C: 032] Set $wgJobSerialCommitThreshold to .1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206451 (owner: 10Aaron Schulz) [16:55:23] haha [16:55:25] akosiaris: : [16:55:26] :) [16:55:31] 'package_builder_rsync': [16:56:09] for the love of... as if I am writing my first puppet code [16:56:51] (03PS1) 10Dzahn: doc.wikimedia.org: fix DirectorySlash https->http [puppet] - 10https://gerrit.wikimedia.org/r/206832 (https://phabricator.wikimedia.org/T95164) [16:57:27] (03PS5) 10Alexandros Kosiaris: Set up rsync module on copper [puppet] - 10https://gerrit.wikimedia.org/r/206828 (owner: 10Ottomata) [16:58:23] cool, lgtm akosiaris, i merge? [16:58:26] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/206832/1" [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [16:59:23] ottomata: yup [17:00:51] (03PS3) 10Faidon Liambotis: Add sysfs module, to handle /sys settings [puppet] - 10https://gerrit.wikimedia.org/r/187430 [17:00:53] (03PS1) 10Faidon Liambotis: ganeti: switch to the sysfs module [puppet] - 10https://gerrit.wikimedia.org/r/206834 [17:00:59] (03CR) 10jenkins-bot: [V: 04-1] Lowered innodb_lock_wait_timeout from defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206442 (owner: 10Aaron Schulz) [17:01:03] akosiaris: ^^^ [17:01:09] (03Merged) 10jenkins-bot: Set $wgJobSerialCommitThreshold to .1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206451 (owner: 10Aaron Schulz) [17:01:23] jenkins be slowww [17:01:45] (03PS4) 10Aaron Schulz: Lowered innodb_lock_wait_timeout from defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206442 [17:02:02] (03CR) 10Alexandros Kosiaris: [C: 031] ganeti: switch to the sysfs module [puppet] - 10https://gerrit.wikimedia.org/r/206834 (owner: 10Faidon Liambotis) [17:02:08] (03CR) 10jenkins-bot: [V: 04-1] Add local crontab monitoring [puppet] - 10https://gerrit.wikimedia.org/r/206833 (https://phabricator.wikimedia.org/T96472) (owner: 10Merlijn van Deen) [17:02:43] !log aaron Synchronized wmf-config/jobqueue-codfw.php: Set to .1 (duration: 00m 27s) [17:02:48] Logged the message, Master [17:02:55] (03CR) 10Ottomata: [C: 032] Set up rsync module on copper [puppet] - 10https://gerrit.wikimedia.org/r/206828 (owner: 10Ottomata) [17:03:02] !log aaron Synchronized wmf-config/jobqueue-eqiad.php: Set to .1 (duration: 00m 11s) [17:03:05] Logged the message, Master [17:03:56] !log aaron Synchronized wmf-config/db-codfw.php: Lowered innodb_lock_wait_timeout from defaults (duration: 00m 22s) [17:03:59] Logged the message, Master [17:04:14] AaronSchulz: hi -- https://gerrit.wikimedia.org/r/#/c/206526/ wasn't cherry-picked as far as I can see; any reason not to? [17:05:26] !log aaron Synchronized wmf-config/db-eqiad.php: Lowered innodb_lock_wait_timeout from defaults (duration: 00m 27s) [17:05:29] Logged the message, Master [17:05:54] (03PS1) 10Ottomata: Use check_command for monitoring::service param [puppet] - 10https://gerrit.wikimedia.org/r/206836 [17:06:23] paravoid, no reason not to [17:06:32] (03CR) 10Ottomata: [C: 032 V: 032] Use check_command for monitoring::service param [puppet] - 10https://gerrit.wikimedia.org/r/206836 (owner: 10Ottomata) [17:07:14] (03PS2) 10Krinkle: doc.wikimedia.org: fix DirectorySlash https->http [puppet] - 10https://gerrit.wikimedia.org/r/206832 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [17:07:20] (03PS3) 10Krinkle: integration: Apache turn DirectorySlash Off [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [17:08:27] (03CR) 10Krinkle: [C: 031] doc.wikimedia.org: fix DirectorySlash https->http [puppet] - 10https://gerrit.wikimedia.org/r/206832 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [17:08:42] (03CR) 10Alexandros Kosiaris: [C: 031] Add sysfs module, to handle /sys settings [puppet] - 10https://gerrit.wikimedia.org/r/187430 (owner: 10Faidon Liambotis) [17:09:25] (03CR) 10Alexandros Kosiaris: [C: 032] Add sysfs module, to handle /sys settings [puppet] - 10https://gerrit.wikimedia.org/r/187430 (owner: 10Faidon Liambotis) [17:09:37] (03CR) 10Alexandros Kosiaris: [C: 032] ganeti: switch to the sysfs module [puppet] - 10https://gerrit.wikimedia.org/r/206834 (owner: 10Faidon Liambotis) [17:09:51] AaronSchulz: awesome, thanks [17:10:47] AaronSchulz: this is T96360 -- technically it's resolved with this change, although I'm not sure if we should keep a task open about the rationale of saving 2.5MB XML blobs in databases [17:11:24] that's a long standing problem that should be fixed [17:11:30] * AaronSchulz has complained about that before [17:13:26] (03CR) 10Dzahn: "untested, should test on labs" [puppet] - 10https://gerrit.wikimedia.org/r/206832 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [17:14:09] AaronSchulz: (unrelated) https://github.com/antirez/disque [17:15:11] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1238711 (10Ottomata) Today I turned of most udp2log webrequest filters. For now, I have left the Fundraising filters, as well as the 5xx and sampled-1000 filters running. A... [17:15:35] hey, for font packages, would using alien to convert rpm to deb be acceptable when there are no deb's yet and it ..just works [17:15:36] PROBLEM - puppet last run on ganeti1001 is CRITICAL puppet fail [17:15:56] PROBLEM - puppet last run on ganeti2006 is CRITICAL puppet fail [17:16:20] (T97181) [17:16:35] mutante: gah, no [17:16:38] no alien please :) [17:16:51] heh, worth a try, it worked though:) [17:17:17] PROBLEM - puppet last run on ganeti2003 is CRITICAL puppet fail [17:17:36] PROBLEM - puppet last run on ganeti2001 is CRITICAL puppet fail [17:17:46] PROBLEM - puppet last run on ganeti2002 is CRITICAL puppet fail [17:17:47] PROBLEM - puppet last run on ganeti2004 is CRITICAL puppet fail [17:18:07] PROBLEM - puppet last run on ganeti1002 is CRITICAL puppet fail [17:18:26] PROBLEM - puppet last run on ganeti1003 is CRITICAL puppet fail [17:19:06] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "If there is no compelling reason to use roles (like some specific config being shared between multiple servers) we should stick to per-hos" [puppet] - 10https://gerrit.wikimedia.org/r/206026 (owner: 10Dzahn) [17:19:30] (03PS1) 10Alexandros Kosiaris: Fix sysfsutils service metaparameter [puppet] - 10https://gerrit.wikimedia.org/r/206840 [17:20:17] (03CR) 10Alexandros Kosiaris: [C: 032] Fix sysfsutils service metaparameter [puppet] - 10https://gerrit.wikimedia.org/r/206840 (owner: 10Alexandros Kosiaris) [17:20:46] !log aaron Synchronized php-1.26wmf3/includes/media/DjVu.php: b980b0a9457b2f98a502cfe36edfc75300c7952f (duration: 00m 27s) [17:20:50] Logged the message, Master [17:22:01] 6operations, 10Wikimedia-SVG-rendering: Install PT (paratype) font on image scalars - https://phabricator.wikimedia.org/T97181#1238735 (10Dzahn) Faidon said 'no alien please'. So yea, needs "real" .deb packages. [17:22:17] !log aaron Synchronized php-1.26wmf2/includes/media/DjVu.php: 40d702b8d2d023d6f701e4aeb082b62b7adf2f0f (duration: 00m 19s) [17:22:20] Logged the message, Master [17:22:52] PROBLEM - puppet last run on ganeti2005 is CRITICAL puppet fail [17:23:03] RECOVERY - puppet last run on ganeti1001 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:23:06] (03Abandoned) 10Dzahn: allow role-based hiera lookup on tin [puppet] - 10https://gerrit.wikimedia.org/r/206026 (owner: 10Dzahn) [17:23:45] <_joe_> mutante: I think we shouldn't abuse the role keyword [17:24:20] <_joe_> mutante: for ganglia_new migration, you need me to do some hiera work. I realized it's the only way for it not being a pain [17:27:02] _joe_: ok, yes, agreed and thanks. i'll use the existing hosts/tin.yaml for that one [17:28:05] the docs say the role keyword itself is already "abusing" puppet internals :) [17:28:18] abuse is good [17:30:19] so besides the host/role, when i add the new ganglia class it works so far and starts a gmond etc, just the old one will also keep running, and you said to not stop it (yet?)? [17:32:17] (03PS1) 10Dzahn: tin -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/206845 [17:32:53] mutante or matanya, any concern with changes like this? https://gerrit.wikimedia.org/r/#/c/206721/1/modules/base/templates/puppet.conf.d/10-main.conf.erb I believe them to be noops but I want a second opinion before I start merging a bunch of ‘em. [17:33:28] andrewbogott: give me 5 min' and i'll review, please [17:33:36] thanks! [17:34:33] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:34:48] (03CR) 10Bmansurov: [C: 031] Enable Browse experiment on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206375 (https://phabricator.wikimedia.org/T94739) (owner: 10Phuedx) [17:35:03] andrewbogott: how about running in compiler? [17:35:15] 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown, 7Documentation: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#1238802 (10chasemp) Does anyone oppose [[ http://www.apache.org/licenses/LICENSE-2.0.html | Apache2 ]] as a default? [17:35:41] mutante: oh, I was thinking that it wouldn’t notice erb but that’s wrong of course. [17:35:43] I will try. [17:36:22] PROBLEM - puppet last run on ganeti1001 is CRITICAL puppet fail [17:38:54] (03CR) 10John F. Lewis: [C: 031] Use @certname instead of certname in .erb [puppet] - 10https://gerrit.wikimedia.org/r/206721 (owner: 10Andrew Bogott) [17:39:23] (03PS1) 10Dereckson: Removed autoreview group on fr.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206848 (https://phabricator.wikimedia.org/T90979) [17:39:34] andrewbogott: looking at https://gerrit.wikimedia.org/r/#/c/206723/1/modules/mysql/templates/apparmor.template.usr.sbin.mysqld.erb I don't see where config_file is coming from ? [17:40:33] matanya: You have hit at the heart of my question — is adding random @’s like that without looking at the invoking code ever dangerous? [17:40:51] it might be andrewbogott [17:40:51] notably, the puppetmaster complained about the lacking @ but didn’t complain about the var being undefined. [17:40:55] :( [17:40:59] What would change? [17:41:11] the scope [17:41:24] you are looking locally [17:41:37] _joe_: is best at explaining this [17:41:44] (03CR) 10John F. Lewis: [C: 031] "Assuming all holes exist." [puppet] - 10https://gerrit.wikimedia.org/r/205903 (owner: 10Dzahn) [17:42:04] you’re right that config_file seems undefined. [17:42:49] (03CR) 10John F. Lewis: [C: 031] "If holes exists." [puppet] - 10https://gerrit.wikimedia.org/r/205904 (owner: 10Dzahn) [17:43:00] matanya: so without the @ it could be pulling that var from elsewhere, not just local scope? [17:43:09] yes [17:43:14] ok [17:43:29] So in order to fix those deprecation notices I have to refactor the entire codebase woo [17:43:54] it is best to prefix @ local vars only [17:44:11] and once your var are all local, you would be happy [17:44:14] andrewbogott: @ is meant to be used for variable that are either puppet local scope or global scope [17:44:20] namely facts [17:44:38] so config_hash is local scope in that regard in there [17:44:49] and needs @ [17:44:55] config_file... where does it come from ? [17:45:02] (03CR) 10Dereckson: "Follow-up: I35ea1d800d79148298a67d08eca5eae16240f94c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206647 (https://phabricator.wikimedia.org/T90979) (owner: 10Dereckson) [17:45:04] thanks akosiaris [17:45:19] (03PS1) 10ArielGlenn: first draft python wrapper for html dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/206849 [17:45:56] I guess it comes from modules/mysql/manifests/config.pp [17:46:14] * modules/mysql/manifests/params.pp [17:47:10] yeah, it is params indeed [17:47:24] which is inherited by the server class [17:47:39] (03CR) 10Dzahn: "hashar, thoughts?" [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10Cscott) [17:47:40] so, here's something that does need to be refactored [17:48:12] the params pattern and the inheritance needs to go [17:48:45] (03CR) 10ArielGlenn: "don't commit this, it's untested except for bit and pieces but shouldn't take log to work out any kinks. Gabriel I added you because this " [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/206849 (owner: 10ArielGlenn) [17:48:49] 6operations, 10Traffic, 7HTTPS: review/rebase/merge the final sslcert patch... - https://phabricator.wikimedia.org/T97316#1238905 (10BBlack) [17:49:52] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:50:11] It clearly needs refactoring but I will probably not be doing that this afternoon :) [17:50:41] (03CR) 10Matanya: [C: 04-1] "config_file comes from modules/mysql/manifests/params.pp which is inherited by the server class," [puppet] - 10https://gerrit.wikimedia.org/r/206723 (owner: 10Andrew Bogott) [17:51:24] RECOVERY - puppet last run on ganeti1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:52:07] matanya, akosiaris, thanks for the info; I will proceed with a more delicate touch [17:52:31] :) [17:52:42] PROBLEM - puppet last run on cp4014 is CRITICAL Puppet last ran 4 hours ago [17:52:42] PROBLEM - puppet last run on cp4004 is CRITICAL Puppet last ran 4 hours ago [17:52:43] PROBLEM - puppet last run on cp1061 is CRITICAL Puppet last ran 4 hours ago [17:52:52] PROBLEM - puppet last run on cp3049 is CRITICAL Puppet last ran 4 hours ago [17:52:52] PROBLEM - puppet last run on cp3009 is CRITICAL Puppet last ran 4 hours ago [17:52:53] PROBLEM - puppet last run on cp1063 is CRITICAL Puppet last ran 4 hours ago [17:52:53] PROBLEM - puppet last run on cp3042 is CRITICAL Puppet last ran 4 hours ago [17:53:03] PROBLEM - puppet last run on cp1046 is CRITICAL Puppet last ran 4 hours ago [17:53:03] PROBLEM - puppet last run on cp1052 is CRITICAL Puppet last ran 4 hours ago [17:53:04] PROBLEM - puppet last run on cp1060 is CRITICAL Puppet last ran 4 hours ago [17:53:13] PROBLEM - puppet last run on cp3031 is CRITICAL Puppet last ran 4 hours ago [17:53:13] PROBLEM - puppet last run on cp3006 is CRITICAL Puppet last ran 4 hours ago [17:53:13] PROBLEM - puppet last run on cp3003 is CRITICAL Puppet last ran 4 hours ago [17:53:22] RECOVERY - puppet last run on ganeti2003 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:53:22] PROBLEM - puppet last run on cp1048 is CRITICAL Puppet last ran 4 hours ago [17:53:23] PROBLEM - puppet last run on cp4019 is CRITICAL Puppet last ran 4 hours ago [17:53:23] PROBLEM - puppet last run on cp3034 is CRITICAL Puppet last ran 4 hours ago [17:53:23] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet last ran 4 hours ago [17:53:23] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet last ran 4 hours ago [17:53:53] !log temp stopped icinga-wm [17:53:56] Logged the message, Master [17:55:19] eh, so puppet run changes the ..puppet run cron? [17:55:59] what's the 'standard' desktop environment on jessie? [17:56:22] They have a multitude of options. [17:57:43] RECOVERY - DPKG on cp1066 is OK: All packages OK [17:58:03] RECOVERY - DPKG on cp1064 is OK: All packages OK [18:00:23] RECOVERY - DPKG on cp1065 is OK: All packages OK [18:00:56] cajoel: GNOME [18:01:22] RECOVERY - puppet last run on ganeti2005 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:01:28] mutante: oddly, there's also a gnome iso [18:01:39] http://cdimage.debian.org/debian-cd/current-live/amd64/iso-hybrid/ [18:01:44] is 'standard' CLI only? [18:01:53] 'server' image? [18:02:11] 416MB has me thinking maybe it is. [18:02:47] (03PS1) 10Krinkle: contint: Move tmpfs and slave-scripts from slave::labs to slave::labs::common [puppet] - 10https://gerrit.wikimedia.org/r/206853 (https://phabricator.wikimedia.org/T97257) [18:03:02] (03PS2) 10Krinkle: contint: Move tmpfs and slave-scripts from slave::labs to slave::labs::common [puppet] - 10https://gerrit.wikimedia.org/r/206853 (https://phabricator.wikimedia.org/T97257) [18:03:42] PROBLEM - puppet last run on cp1055 is CRITICAL Puppet last ran 4 hours ago [18:03:43] YuviPanda: (I think it was you) the de wiki dumps should be showing up in labs any day now [18:04:00] if they aren't there already I expect them to be there tommorrow [18:04:13] PROBLEM - puppet last run on cp3040 is CRITICAL Puppet last ran 4 hours ago [18:04:24] apergos: it was me! [18:04:25] cool [18:04:32] fatalmonitor is full of "Search backend error during full_text search " warnings. Are those something I should ignore or does this indicate a problem with elasticsearch? [18:05:00] cajoel: sorry, meeting now. not sure because i never really download full iso's, the tiny netinstall image is enough https://www.debian.org/distrib/netinst , then the rest it can fetch on demand [18:05:07] <^d> twentyafterfour: Already tracked [18:05:14] RECOVERY - puppet last run on cp1055 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:21] <^d> twentyafterfour: Annoying, but not end of the world. [18:05:43] RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [18:05:53] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:53] RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:06:00] <^d> twentyafterfour: https://phabricator.wikimedia.org/T94814 is the tracking task. [18:06:10] <^d> (there's actually 3 bugs here showing themselves) [18:06:12] RECOVERY - puppet last run on cp1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:06:23] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:06:36] cajoel: i'd get netinstall image and write it on a flash drive, then boot from that, let it do DHCP, and select what you want during installer .. dont worry about iso images [18:06:53] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:07:13] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [18:07:13] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [18:07:22] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 28 hours old. [18:07:42] RECOVERY - puppet last run on cp1071 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:07:42] RECOVERY - puppet last run on cp1058 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [18:07:43] RECOVERY - puppet last run on cp3041 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:07:52] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:07:52] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:08:24] RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:08:33] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:09:03] RECOVERY - puppet last run on cp4001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:09:12] RECOVERY - puppet last run on cp1050 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:09:16] <^d> twentyafterfour: The flood you see is a combo of the known bug and a guy who's got a poorly-written bot. It's being handled this week. [18:09:23] RECOVERY - puppet last run on cp4005 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:09:23] RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:09:25] (03CR) 10Krinkle: [C: 04-1] "Causes Apr 27 18:06:44 deployment-bastion puppet-agent[30741]: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplica" [puppet] - 10https://gerrit.wikimedia.org/r/206853 (https://phabricator.wikimedia.org/T97257) (owner: 10Krinkle) [18:09:42] RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:09:53] RECOVERY - puppet last run on cp1046 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:10:03] RECOVERY - puppet last run on cp3003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:10:13] RECOVERY - puppet last run on cp4019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:10:33] RECOVERY - puppet last run on cp1062 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [18:10:43] RECOVERY - puppet last run on cp3035 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:10:43] RECOVERY - DPKG on cp1063 is OK: All packages OK [18:10:53] RECOVERY - puppet last run on cp3005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:11:03] RECOVERY - puppet last run on cp4018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:11:13] RECOVERY - puppet last run on cp1063 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:11:43] RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:11:50] Krinkle: pet peeve: could we rename 'contint' to 'integration' or 'ci' at some point? [18:12:08] where? [18:12:22] greg-g: hey, saw my ping earlier on? [18:12:40] * greg-g looks [18:12:54] ahh, wikibase update? [18:12:59] Exactly [18:13:00] (03PS1) 10Thcipriani: Make scap localization cache build $TMPDIR aware [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) [18:13:02] RECOVERY - puppet last run on cp3049 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [18:13:03] not sure it will happen today [18:13:04] 6operations, 10incident-20150422-LabsOutage: packages not upgraded post-install - https://phabricator.wikimedia.org/T94177#1239081 (10fgiunchedi) I'm attaching this to the labs outage since it might have prevented (hidden?) the outage if packages were updated post-install [18:13:13] there have been code style concairns on gerrit [18:13:13] RECOVERY - puppet last run on cp1060 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:13:19] but it's unbreak now [18:13:28] trying to get the patch into shape literally right now [18:13:33] RECOVERY - puppet last run on cp3034 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:13:52] hoo: ok, doit [18:13:52] RECOVERY - puppet last run on cp3019 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [18:14:02] RECOVERY - puppet last run on ganeti2001 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:14:03] RECOVERY - puppet last run on ganeti2006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:14:03] RECOVERY - puppet last run on ganeti2002 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:14:35] ori: In puppet? [18:14:42] yeah [18:14:46] sure [18:14:52] RECOVERY - puppet last run on cp1052 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:14:53] RECOVERY - puppet last run on cp3039 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:15:01] contint is weird (continuous integer? continental breakfast?) [18:15:02] RECOVERY - puppet last run on ganeti1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:15:43] RECOVERY - puppet last run on cp4020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:16:49] (03PS1) 10Dereckson: Enabled ShortUrl on kn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206857 (https://phabricator.wikimedia.org/T97218) [18:17:12] <^d> Preferably not a continental breakfast. [18:17:13] (03PS3) 10Yuvipanda: Tools: Fix and clean up generation of /etc/ssh/ssh_known_keys [puppet] - 10https://gerrit.wikimedia.org/r/196125 (https://phabricator.wikimedia.org/T92379) (owner: 10Tim Landscheidt) [18:17:49] (03CR) 10Yuvipanda: [C: 032 V: 032] "Thank you everyone :)" [puppet] - 10https://gerrit.wikimedia.org/r/196125 (https://phabricator.wikimedia.org/T92379) (owner: 10Tim Landscheidt) [18:21:40] 6operations, 6Labs: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1239133 (10yuvipanda) 3NEW [18:22:02] 6operations, 6Labs, 10Tool-Labs, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1239144 (10yuvipanda) [18:22:18] 6operations, 5wikis-in-codfw: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1239146 (10Joe) 3NEW a:3Joe [18:22:30] 6operations, 5wikis-in-codfw: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1239154 (10Joe) p:5Triage>3High [18:23:21] <_joe_> paravoid: ^^ [18:23:29] (03PS3) 10Krinkle: contint: Move jenkins/tmpfs from slave::labs to slave::labs::common [puppet] - 10https://gerrit.wikimedia.org/r/206853 (https://phabricator.wikimedia.org/T97257) [18:29:04] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [18:30:14] (03CR) 10Ori.livneh: [C: 031] "LGTM; let me know if you want a merge." [puppet] - 10https://gerrit.wikimedia.org/r/206853 (https://phabricator.wikimedia.org/T97257) (owner: 10Krinkle) [18:31:11] (03CR) 10Krinkle: [C: 031] "Cherry-picked to integration-puppetmaster and deployment-salt in labs. Verified to work as expected without errors." [puppet] - 10https://gerrit.wikimedia.org/r/206853 (https://phabricator.wikimedia.org/T97257) (owner: 10Krinkle) [18:31:12] ori: Ready :) [18:31:25] (03PS4) 10Ori.livneh: contint: Move jenkins/tmpfs from slave::labs to slave::labs::common [puppet] - 10https://gerrit.wikimedia.org/r/206853 (https://phabricator.wikimedia.org/T97257) (owner: 10Krinkle) [18:31:32] (03CR) 10Ori.livneh: [C: 032 V: 032] contint: Move jenkins/tmpfs from slave::labs to slave::labs::common [puppet] - 10https://gerrit.wikimedia.org/r/206853 (https://phabricator.wikimedia.org/T97257) (owner: 10Krinkle) [18:31:59] YuviPanda: I merged: Tools: Fix and clean up generation of /etc/ssh/ssh_known_keys (2f8fe4c4d0) [18:32:09] ori: bah, got distracted by meeting. sorry [18:37:44] (03CR) 10Yurik: [C: 031] Prevent new wikis to use Graph: namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206776 (owner: 10Dereckson) [18:39:45] (03PS3) 10Ori.livneh: Prevent new wikis from using Graph: namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206776 (owner: 10Dereckson) [18:44:01] (03CR) 10BryanDavis: Make scap localization cache build $TMPDIR aware (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) (owner: 10Thcipriani) [18:44:21] how can I add merge my extension to puppet repo? [18:44:56] devunt: what is your extension? is it a MediaWiki extension? [18:45:03] yes it is. [18:45:20] I'm planning to deploy it to beta cluster [18:45:36] have you read https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1d:_new_extension already? [18:45:42] does your extension have a repository in gerrit? [18:46:10] yes. [18:46:25] what is the extension name? [18:46:30] Josa [18:46:34] * ori looks [18:47:08] devunt: it would have a separate repository 'mediawiki/extensions/foo' [18:47:34] yeah, it has one [18:47:34] I already have submitted these changesets: [18:47:35] https://gerrit.wikimedia.org/r/#/c/203642/ [18:47:37] https://gerrit.wikimedia.org/r/#/c/203627/ [18:47:54] (03PS9) 10Ori.livneh: Add Josa extension to ko.wikipedia.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt) [18:48:05] (03CR) 10Ori.livneh: [C: 032] Add Josa extension to ko.wikipedia.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt) [18:48:10] (03Merged) 10jenkins-bot: Add Josa extension to ko.wikipedia.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt) [18:48:40] devunt: now you just wait up to 5 minutes [18:49:39] (03PS1) 10Aaron Schulz: [WIP] Set $wgActivityUpdatesUseJobQueue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206862 [18:50:06] Okay. [18:50:40] (03PS1) 10Krinkle: Revert "contint: Move jenkins/tmpfs from slave::labs to slave::labs::common" [puppet] - 10https://gerrit.wikimedia.org/r/206863 [18:51:00] (03CR) 10Mjbmr: [C: 031] Enabled ShortUrl on kn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206857 (https://phabricator.wikimedia.org/T97218) (owner: 10Dereckson) [18:51:40] (03CR) 10Thcipriani: Make scap localization cache build $TMPDIR aware (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) (owner: 10Thcipriani) [18:51:58] (03CR) 10Krinkle: "We can create a larger tmpfs mount inside deployment if it helps performance, but not inside the generic ci slave role." [puppet] - 10https://gerrit.wikimedia.org/r/206863 (owner: 10Krinkle) [18:52:32] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 28 hours old. [18:54:01] (03CR) 10Krinkle: [C: 031] "Cherry-picked to integration-puppetmaster and deployment-salt in labs. Fixes the broken job." [puppet] - 10https://gerrit.wikimedia.org/r/206863 (owner: 10Krinkle) [18:54:12] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [19:01:45] (03PS2) 10BBlack: varnish: implement 'do_gzip' cluster option for mobile/text frontend, too [puppet] - 10https://gerrit.wikimedia.org/r/206348 (owner: 10Ori.livneh) [19:01:55] (03CR) 10BBlack: [C: 032 V: 032] varnish: implement 'do_gzip' cluster option for mobile/text frontend, too [puppet] - 10https://gerrit.wikimedia.org/r/206348 (owner: 10Ori.livneh) [19:03:01] (03CR) 10Alex Monk: Prevent new wikis from using Graph: namespace (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206776 (owner: 10Dereckson) [19:03:41] (03PS1) 10Anomie: Remove sampling of api.log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206865 (https://phabricator.wikimedia.org/T88393) [19:04:14] ori, it doesn't seem work [19:04:34] there's no Josa on http://ko.wikipedia.beta.wmflabs.org/w/index.php?title=Special:Version&uselang=en [19:05:13] (03CR) 10BBlack: [C: 04-1] "Thinking this through a little better: probably if I deploy this immediately for all traffic (like the current patch), it will effectively" [puppet] - 10https://gerrit.wikimedia.org/r/206387 (owner: 10BBlack) [19:09:41] (03CR) 10Dzahn: "what Hoo said, you can just modify the old cron" [puppet] - 10https://gerrit.wikimedia.org/r/205644 (owner: 10Aude) [19:10:27] (03PS1) 10Ottomata: Add parameters for minimum and maximum allocation vcores [puppet/cdh] - 10https://gerrit.wikimedia.org/r/206866 [19:10:53] devunt: looking [19:11:15] (03CR) 10Ottomata: [C: 032] Add parameters for minimum and maximum allocation vcores [puppet/cdh] - 10https://gerrit.wikimedia.org/r/206866 (owner: 10Ottomata) [19:12:25] (03PS1) 10Ottomata: Setting minimum allocation mb and vcores to 0 to allow Impala to submit small reservertion requests [puppet] - 10https://gerrit.wikimedia.org/r/206877 (https://phabricator.wikimedia.org/T96329) [19:12:38] (03PS2) 10Ottomata: Setting minimum allocation mb and vcores to 0 to allow Impala to submit small reservertion requests [puppet] - 10https://gerrit.wikimedia.org/r/206877 (https://phabricator.wikimedia.org/T96329) [19:15:35] (03CR) 10Ottomata: [C: 032] Setting minimum allocation mb and vcores to 0 to allow Impala to submit small reservertion requests [puppet] - 10https://gerrit.wikimedia.org/r/206877 (https://phabricator.wikimedia.org/T96329) (owner: 10Ottomata) [19:16:25] (03PS1) 10BBlack: de-dupe /static hashing for text/mobile [puppet] - 10https://gerrit.wikimedia.org/r/206878 (https://phabricator.wikimedia.org/T95448) [19:17:09] (03CR) 10Filippo Giunchedi: Remove sampling of api.log (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206865 (https://phabricator.wikimedia.org/T88393) (owner: 10Anomie) [19:17:17] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1239411 (10BBlack) https://gerrit.wikimedia.org/r/206878 <- vcl_hash support for this (optimization) [19:23:11] twentyafterfour, the irc bouncer seems to format gerrit URLs like "https://gerrit.wikimedia.org/r/206865%3C/span%3E" when clicked...some sort of misparsing going on [19:24:04] also switching channels is very slow and is blocking JS :( [19:25:28] AaronSchulz: I’m giving out WMF branced IRCCloud accounts if you want :) [19:29:38] (03CR) 10Ori.livneh: "Two small suggestions:" [puppet] - 10https://gerrit.wikimedia.org/r/206878 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [19:29:41] ^ bblack [19:31:03] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:32:03] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:35:49] ottomata: ^ unmerged is you [19:37:09] oo? [19:37:10] k [19:37:27] whoops, thanks [19:37:37] AaronSchulz: yeah I'm aware of both of those bugs... the slow channel switching is really annoying (it's doing some unnecessary recursive sorting of messages...) [19:37:45] 6operations, 10ops-eqiad, 10fundraising-tech-ops: barium has a failed HDD - https://phabricator.wikimedia.org/T93899#1239430 (10Jgreen) 5Resolved>3Open Reopening because RAID is degraded (still? again?) -- appears that one of the disks is still offline, but I'm not sure whether its the one that physicall... [19:37:58] twentyafterfour: what are you using? custom bouncer? [19:38:11] devunt: still looking [19:39:34] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 29 hours old. [19:40:28] ori: ircanywhere [19:40:35] (03PS2) 10BBlack: de-dupe /static hashing for text/mobile [puppet] - 10https://gerrit.wikimedia.org/r/206878 (https://phabricator.wikimedia.org/T95448) [19:40:44] the bouncer isn't the buggy part, it's the web interface that's got a few bugs [19:41:14] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [19:44:45] (03CR) 10Anomie: Remove sampling of api.log (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206865 (https://phabricator.wikimedia.org/T88393) (owner: 10Anomie) [19:47:00] (03CR) 10Ori.livneh: [C: 031] de-dupe /static hashing for text/mobile [puppet] - 10https://gerrit.wikimedia.org/r/206878 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [19:47:16] !log apt-get upgrade on iron (incl. apt itself, gnupg, ssl) [19:47:24] Logged the message, Master [19:48:15] devunt: beta<-> CI bridge is currently down because of maintenance [19:48:23] it should be deployed as soon as that is fixed [19:48:29] no ETA but hopefully not too long [19:48:51] wee, and libc6 and dnsutils :p [19:48:57] devunt: https://phabricator.wikimedia.org/T97257 [19:49:25] Yeah, beta deployments are currently down [19:50:04] I'll try to get one build through [19:58:25] 6operations, 7Documentation: Create documentation on the requesting/allocation of virtual machines in the misc cluster - https://phabricator.wikimedia.org/T97072#1239443 (10RobH) Update from IRC: If a user needs either a VM or bare metal, and they aren't certain of which, they can file a task with #operations... [19:59:31] 6operations, 10Wikimedia-DNS: Set up new URL - https://phabricator.wikimedia.org/T97329#1239453 (10Krenair) Please associate projects when creating tickets. [19:59:47] 6operations, 7Documentation: create #vm-requests (a production vm cluster request project similar to #hardware-requests) - https://phabricator.wikimedia.org/T97330#1239455 (10RobH) 3NEW a:3RobH [20:02:07] heya, robh, is jessie the default OS if I do an install now? [20:02:12] i'm going to reinstall oxygen and I want it to pick up jessie [20:02:18] not unless you specify it [20:02:23] ah, ok [20:02:26] 6operations, 6Project-Creators, 7Documentation: create #vm-requests (a production vm cluster request project similar to #hardware-requests) - https://phabricator.wikimedia.org/T97330#1239465 (10Krenair) These tickets need to be associated with #Project-Creators [20:02:30] so it is not the default then? [20:02:31] you can see the default in the dhcp base config [20:02:42] it isnt the defaut yet [20:02:48] ok cool [20:02:49] default even, i cannot spell [20:02:51] thanks [20:03:00] though, it begs the question, why isnt it yet? ;D [20:03:17] I'm not sure we hit the 50% or more of cluster is jessie, if we have, then it makes sense to change it. [20:03:49] s/cluster/entire server farm across all clusters and sites [20:04:01] ja [20:04:20] that's funny, i am getting those stats right now [20:04:23] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:04:26] hold on [20:04:34] (03CR) 10Aklapper: [C: 031] Phab monthly stats email: Show how many projects saw workboard moves [puppet] - 10https://gerrit.wikimedia.org/r/206518 (owner: 10Aklapper) [20:04:45] so puppet/modules/install-server/files/dhcpd/dhcpd.conf has the lines where the config states the default install [20:04:45] oo, robh, another question [20:04:49] oxygen currently has a public IP [20:04:52] it shouldn't anymore. [20:04:53] which is still trutsy as of now [20:04:58] ottomata: robh: i'll tell you if it's over 50% or not in a few [20:05:09] ottomata: ok, you are reinstalling and not keeping data? [20:05:14] yes [20:05:15] wiping is good [20:05:17] cuz if so, now is the good time to redo the ip [20:05:22] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [20:05:40] so, if you are reinstalling, I'd suggest you use the tempate for a fresh install anyhow [20:05:56] 6operations, 10Traffic: increase misc-web-lb cp pool from 2 to 3 systems? - https://phabricator.wikimedia.org/T86718#1239468 (10BBlack) [20:05:59] https://wikitech.wikimedia.org/wiki/Phabricator#Hardware.2FServer_Setup_.2F_Deployment_Stage_Workflow [20:06:16] since we'll have to redo the production dns and then redo the production switch port vlan assignment [20:06:30] 6operations, 10Traffic: increase misc-web-lb cp pool from 2 to 3 systems? - https://phabricator.wikimedia.org/T86718#974794 (10BBlack) Blocking this on ipsec, so we don't reduce the security of any critical HTTPS sessions behind misc-web by moving them out to remote TLS termination and then backhauling to eqia... [20:06:31] ottomata: also, if you need a wipe, an actual disk wipe, then we have to set an onsite task for it [20:06:37] (03PS2) 10Aklapper: Phab monthly stats email: Clarify what day values for priority mean [puppet] - 10https://gerrit.wikimedia.org/r/206515 [20:06:38] versus just reinstall over the top. [20:06:43] (03CR) 10jenkins-bot: [V: 04-1] Phab monthly stats email: Clarify what day values for priority mean [puppet] - 10https://gerrit.wikimedia.org/r/206515 (owner: 10Aklapper) [20:06:59] So you want to reinstall to jessie and move to an internal ip/vlan, what else? =] [20:07:18] would be nice to have a PXE boot option "wipe_self" [20:07:26] but needs triple opt-in :p [20:07:29] mutante: too dangerous, i thought about it [20:07:38] the only way its ok is to set it to ONLY serve that image on a special vlan [20:07:43] and put systems into it for testing and wipe [20:07:44] type: "YesIAmSure" [20:07:46] robh, no need for wipe wipe, jsut reinstall on top is fine [20:07:48] (03CR) 10Aklapper: [C: 031] Phab monthly stats email: Clarify what day values for priority mean [puppet] - 10https://gerrit.wikimedia.org/r/206515 (owner: 10Aklapper) [20:07:51] its on a long term project for me but i havent gotten to it [20:07:57] i might have a look at what disks it has and change part layout, but maybe not [20:07:59] ideally we have a test/burnin vlan where we do all testing and wipes [20:08:05] robh: *nod* [20:08:18] ok, robh, will make a ticket [20:08:20] i was messing with dban but now its non ideal. [20:08:40] i think the ideal at this point is a debian live image boot into a wipe via command line [20:08:44] (automated) [20:08:56] but, thats just from random musing, not actual dug in work yet. [20:10:53] 6operations: Reinstall oxygen with Jessie - https://phabricator.wikimedia.org/T97331#1239478 (10Ottomata) 3NEW [20:11:02] robh: https://phabricator.wikimedia.org/T97331 [20:11:46] so the mgmt stuff is done, existing server [20:11:52] but its a good template to make sure we dont miss shit [20:12:04] ottomata: I can do - network switch setup (port description & vlan) - needs change to private IP. for you [20:12:14] 10Ops-Access-Requests, 6operations: Add ebernhardson to 'stats' group for query access to eventlogging data on stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97332#1239487 (10EBernhardson) 3NEW [20:12:18] if you do the dns changes for production (i can review) [20:12:27] but, i wanna make sure you arent using it right now [20:12:32] robh, yeah if you do that i can do the rest [20:12:33] if you arent, i'll change now [20:12:36] it is ready to go [20:12:37] do it now [20:12:43] cool, doing [20:12:46] woohoo [20:12:51] prompt turnaround! :) [20:12:56] class role::suicide ( exec { "dd if=/dev/zero of=/dev/sda bs=4096" *jk* [20:13:08] so row a [20:13:15] private1-a-eqiad is your subnet [20:13:18] robh, you need to tell me which...aye ya, which vlan, ja? ok cool [20:13:20] just pick the next IP? [20:13:29] yep, you can set me to review if ya like [20:13:42] next available, or if htere is a spare one somewhere in the range that is now free, either works [20:14:00] k looking [20:14:32] 6operations, 6Project-Creators, 7Documentation: create #vm-requests (a production vm cluster request project similar to #hardware-requests) - https://phabricator.wikimedia.org/T97330#1239497 (10Aklapper) No objections, but [[ https://www.mediawiki.org/wiki/Phabricator/Creating_and_renaming_projects#Guideline... [20:14:44] vlan change done [20:15:02] 6operations: Reinstall oxygen with Jessie - https://phabricator.wikimedia.org/T97331#1239498 (10RobH) [20:15:20] 6operations: Reinstall oxygen with Jessie - https://phabricator.wikimedia.org/T97331#1239478 (10RobH) [20:15:46] 6operations, 10Traffic: Align prod + beta-cluster LVS usage - https://phabricator.wikimedia.org/T97333#1239505 (10BBlack) 3NEW [20:16:05] (03CR) 10BryanDavis: [C: 031] "LGTM. Someone will need to keep an eye on disk usage on fluorine after." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206865 (https://phabricator.wikimedia.org/T88393) (owner: 10Anomie) [20:17:43] PROBLEM - Host oxygen is DOWN: CRITICAL - Host Unreachable (208.80.154.15) [20:18:15] (03PS1) 10Ottomata: Make oxygen use private instead of public IP. It will be reinstalled [dns] - 10https://gerrit.wikimedia.org/r/206939 (https://phabricator.wikimedia.org/T96616) [20:18:20] robh: ^ [20:18:30] ah, i should remove oxygen as is from icinga, sorry one sec [20:18:58] What is oxygen anyways? [20:19:10] it was a udp2log server [20:19:16] Oh okay [20:19:20] collecting webrequest access logs and filtering and sampling them out to files [20:19:23] something we used to breath, but apparently ottomata fixed that bug in human biology and rendered it obsolete [20:19:34] ottomata: you need to revoke all the keys, all puppetstoreddbconfig stuff too, etc... [20:19:37] naw, we still got it, we will just breathe it privately [20:19:39] that stuff isnt on there, lemme add [20:19:45] yes, robh, i did keys, but not puppetsotredconfig, jsut did [20:19:51] oh, then i wont add [20:19:52] heh [20:19:55] (03CR) 10BryanDavis: Make scap localization cache build $TMPDIR aware (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) (owner: 10Thcipriani) [20:20:18] i mean, you probalby should, as it now says to schedule maintenance for the node [20:20:31] which, i think i did, but, i think i did it earlier and didn't set a long time [20:20:54] (03CR) 10RobH: [C: 031] Make oxygen use private instead of public IP. It will be reinstalled [dns] - 10https://gerrit.wikimedia.org/r/206939 (https://phabricator.wikimedia.org/T96616) (owner: 10Ottomata) [20:21:14] looks good to me, feel free to merge and push live [20:21:35] k doing [20:21:43] (03CR) 10Ottomata: [C: 032 V: 032] Make oxygen use private instead of public IP. It will be reinstalled [dns] - 10https://gerrit.wikimedia.org/r/206939 (https://phabricator.wikimedia.org/T96616) (owner: 10Ottomata) [20:21:47] oh need to set jessie in puppet too.. [20:23:43] (03PS1) 10Ottomata: Reinstall oxygen as Jessie and .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/206950 (https://phabricator.wikimedia.org/T96616) [20:23:52] robh, look right? [20:23:53] https://gerrit.wikimedia.org/r/#/c/206950/1/modules/install-server/files/dhcpd/linux-host-entries.ttyS1-115200 [20:24:24] (03CR) 10RobH: [C: 031] Reinstall oxygen as Jessie and .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/206950 (https://phabricator.wikimedia.org/T96616) (owner: 10Ottomata) [20:24:26] yep, [20:24:36] danke [20:24:36] though you dont have to include in site.pp at all for just standard, it doesnt hurt. [20:24:44] (03CR) 10Ottomata: [C: 032] Reinstall oxygen as Jessie and .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/206950 (https://phabricator.wikimedia.org/T96616) (owner: 10Ottomata) [20:24:48] i see you just wanted to remove now invalid commenting =] [20:25:54] robh, ah, yes, but i also will add more things to that shortly [20:25:56] once it is up [20:27:43] hm, robh, oxygen is not in netboot.cfg, what happens? [20:28:12] no big deal, just add it to the proper stanza based on how you want it partitioned [20:28:21] its odd that its not there now, but random cleanups can remove items [20:28:33] yae [20:28:44] random cleanups, or extreme paranoia (if its not in netboot, it cannot auto partition) [20:28:44] was wondering what happens if a host is not listed there [20:28:52] it sits on the manual partitioning menu [20:29:09] and manual partitioning isnt allowed for production hosts! ;D [20:30:16] (03PS1) 10Ottomata: Use lvm.cfg for oxygen partman [puppet] - 10https://gerrit.wikimedia.org/r/206952 (https://phabricator.wikimedia.org/T96616) [20:30:16] ah [20:30:18] k [20:30:40] (03CR) 10Ottomata: [C: 032 V: 032] Use lvm.cfg for oxygen partman [puppet] - 10https://gerrit.wikimedia.org/r/206952 (https://phabricator.wikimedia.org/T96616) (owner: 10Ottomata) [20:32:49] (03CR) 10Andrew Bogott: [C: 032] Use @certname instead of certname in .erb [puppet] - 10https://gerrit.wikimedia.org/r/206721 (owner: 10Andrew Bogott) [20:36:39] !log deployed parsoid sha ebdac59b [20:36:45] Logged the message, Master [20:44:53] (03PS2) 10Andrew Bogott: @qualify a local .erb variable [puppet] - 10https://gerrit.wikimedia.org/r/206722 [20:44:55] (03PS2) 10Andrew Bogott: @qualify some .erb variable references. [puppet] - 10https://gerrit.wikimedia.org/r/206723 [20:47:18] (03CR) 10Andrew Bogott: [C: 032] @qualify a local .erb variable [puppet] - 10https://gerrit.wikimedia.org/r/206722 (owner: 10Andrew Bogott) [20:49:35] 6operations, 6Phabricator: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#1239647 (10chasemp) [20:49:52] 6operations, 6Phabricator: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#1239652 (10chasemp) 5stalled>3Open [20:50:16] 6operations, 6Phabricator: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#991959 (10chasemp) I setup phab-01 with the relevant settings and ran through and came up with a few of the aforementioned issues. Sorry this took me... [20:51:33] (03PS3) 10Andrew Bogott: @qualify some .erb variable references. [puppet] - 10https://gerrit.wikimedia.org/r/206723 [20:54:18] 10Ops-Access-Requests, 6operations: Add ebernhardson to 'stats' group for query access to eventlogging data on stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97332#1239675 (10chasemp) I'm unsure of the right permissions wise thing to do, @ottomata would know best, but if it is indeed posix stats gro... [20:56:17] (03CR) 10Andrew Bogott: [C: 032] @qualify some .erb variable references. [puppet] - 10https://gerrit.wikimedia.org/r/206723 (owner: 10Andrew Bogott) [20:59:02] 6operations, 10Wikimedia-DNS: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1239706 (10Aklapper) [21:03:48] 6operations, 6Phabricator: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#1239721 (10RobH) [21:05:37] 6operations, 10Wikimedia-DNS: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1239741 (10Dzahn) Hi, can you let us know a little more how this is going to be used? Because the way we need to add it depends on that. After we add the name how would content get there? Would it just... [21:07:13] (03PS6) 10Andrew Bogott: puppetsigner: Clean up certs for instances we can't find in ldap [puppet] - 10https://gerrit.wikimedia.org/r/205897 [21:07:28] 6operations: Reinstall oxygen with Jessie - https://phabricator.wikimedia.org/T97331#1239743 (10Ottomata) [21:08:30] 6operations: Reinstall oxygen with Jessie - https://phabricator.wikimedia.org/T97331#1239478 (10Ottomata) I've done this: racadm config -g cfgServerInfo -o cfgServerBootOnce 1 racadm config -g cfgServerInfo -o cfgServerFirstBootDevice PXE racadm serveraction powercycle console com2 but am not getti... [21:08:44] robh: https://phabricator.wikimedia.org/T97331 [21:08:47] not sure what's up now :/ [21:09:30] hrmm [21:09:44] boot into bios and check the serial console redireciton [21:09:55] and ensure it is set to com2 as the redirection port [21:09:59] and com1 as the external port [21:10:07] (or i can hop on and do that, but figured i'd tell ya how ;) [21:10:13] naw i'm in console now will try [21:10:23] of course, if tis all fucked up, you may not even be able to see console [21:10:28] but thats unusual for older systems [21:10:37] PROBLEM - puppet last run on mw2095 is CRITICAL puppet fail [21:10:45] wait, right, um [21:10:50] wait, so i should [21:10:56] -o cfgServerFirstBootDevice BIOS [21:10:59] and powercycle, ja? [21:11:02] then console com2 again? [21:12:15] same thing, btw [21:12:17] no output [21:12:25] i also can't seem to escape with the usual ^\ [21:12:37] oh [21:12:37] yes i can [21:13:05] hrmm... [21:13:06] ja no output though [21:13:14] so yea... sounds like its serial redireciton is fucked [21:13:24] lemme take a quick look before we hand to onsite [21:13:27] ok cool [21:13:30] thanks [21:13:39] no need to disconnect entirely [21:13:44] 6operations, 10Wikimedia-SVG-rendering: Install (currently non-existing) Debian packages for PT (paratype) font on image scalars - https://phabricator.wikimedia.org/T97181#1239763 (10Aklapper) 5Open>3stalled [21:13:44] just dont be in console com2 [21:14:27] powercycling to bios, lets see if i have different result (i dont expect it to be) [21:15:08] k [21:15:29] ottomata: so, if this is indeed a borked setting, we'll need to make a sub-task/blocking task in ops-eqiad for chris to fix the settings [21:15:38] and, since its not posting, certainly seems to be [21:16:04] so you wanna make the task, stating you cannot get serial redirection console, adn that the bios settings need to be checked/confirmed? [21:16:08] robh, ok. i think i hate subtasks, just curious, why subtasks? can't we just use the same ticket? i already have so many tickets and I'm having troulbe tracking! [21:16:24] well, its messy to just assign different projects and hand offs [21:16:29] onsite work tasks shoudl be sub tasks [21:16:32] i guess? ooook [21:16:34] it keeps the who does what a lot more clear [21:17:02] then you keep the main task, and when you see the subtask resolve, you know its ok to move on it [21:17:20] 6operations, 10ops-eqiad: Cannot get serial redirection console on oxygen - https://phabricator.wikimedia.org/T97339#1239775 (10Ottomata) 3NEW [21:17:27] robh^ [21:18:07] i'd provide more context personally, but i like the onsites to know I tried to fix something myself before asking them [21:18:22] so i'd state that I attempted to reboot into bios, but console redirection isnt set, requiring onsite... etc.. [21:19:15] 6operations, 10ops-eqiad: Cannot get serial redirection console on oxygen - https://phabricator.wikimedia.org/T97339#1239789 (10RobH) I also attempted to reboot this system into bios to check the settings, but it seems the bios redirection isn't set. This will need on-site crash-cart connection and manual set... [21:19:19] (03Abandoned) 10Jgreen: dmarc parser and database injector [puppet] - 10https://gerrit.wikimedia.org/r/185472 (owner: 10Jgreen) [21:19:21] (03PS5) 10Dzahn: sshd: use Chacha20-poly1305,AES-CGM ciphers [puppet] - 10https://gerrit.wikimedia.org/r/185325 [21:19:47] basically folks tend to say 'this is broken, fix' and i see a LOT of tasks in the past hit the onsite person [21:19:55] when it was a software fix that the opsen could have totally done without onsite [21:20:05] its why i tend to state in my tickets why i htink its onsite work ;D [21:20:30] (03PS1) 10Ori.livneh: varnish: enable gzip for mobile / text on labs [puppet] - 10https://gerrit.wikimedia.org/r/206960 [21:20:41] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 5 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1239793 (10GWicke) @bblack, will Varnish consider Varnish-side backend request timeouts as equivalent to 5xx? [21:20:47] 6operations, 7Mail: set up DMARC aggregate report collection into a database for research and reporting - https://phabricator.wikimedia.org/T86209#1239795 (10Jgreen) 5Open>3stalled [21:20:58] 6operations, 7Mail: set up DMARC aggregate report collection into a database for research and reporting - https://phabricator.wikimedia.org/T86209#963609 (10Jgreen) p:5Normal>3Low [21:21:00] (03CR) 10Dzahn: [C: 032] "review was on PS4, this is only rebasing, checking on one example per distro release" [puppet] - 10https://gerrit.wikimedia.org/r/185325 (owner: 10Dzahn) [21:21:20] 6operations, 6Labs, 10Tool-Labs, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1239797 (10yuvipanda) To begin with, how about we just have one for when toollabs home page is up and showing sensible things? [21:21:22] (03PS2) 10Ori.livneh: varnish: enable gzip for mobile / text on labs [puppet] - 10https://gerrit.wikimedia.org/r/206960 [21:21:41] (03PS3) 10Ori.livneh: varnish: enable gzip for mobile / text on labs [puppet] - 10https://gerrit.wikimedia.org/r/206960 [21:22:08] changes sshd config in base ... [21:22:14] (03CR) 10Ori.livneh: [C: 032 V: 032] "labs-only" [puppet] - 10https://gerrit.wikimedia.org/r/206960 (owner: 10Ori.livneh) [21:24:47] PROBLEM - puppet last run on osmium is CRITICAL Puppet last ran 1 day ago [21:25:52] I love it when I visit Special:Version from a random website and find that it runs on MW :) [21:26:06] osmium is me [21:27:58] RECOVERY - puppet last run on osmium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:28:29] 6operations, 10Wikimedia-DNS: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1239834 (10Heather) 3NEW [21:28:48] RECOVERY - puppet last run on mw2095 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:31:08] (03PS2) 10BryanDavis: Make scap localization cache build $TMPDIR aware [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) (owner: 10Thcipriani) [21:34:25] (03CR) 10Dzahn: "sodium: Ciphers aes256-ctr,aes192-ctr,aes128-ctr (lucid)" [puppet] - 10https://gerrit.wikimedia.org/r/185325 (owner: 10Dzahn) [21:35:46] 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown, 7Documentation: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#1239859 (10hashar) Apache 2 is good to me. [21:36:49] (03CR) 10Jforrester: [C: 031] "Good to go in the SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206502 (owner: 10GWicke) [21:37:05] (03CR) 10Yuvipanda: "lgtm - apparently other people (and puppetlint, boo) like arrows to be aligned - you can either do that, or go 'meh' and I'll merge it eit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/206118 (https://phabricator.wikimedia.org/T96898) (owner: 10Merlijn van Deen) [21:41:37] (03CR) 10Dzahn: "topic branch is called "(detached" and yea, aligning arrows would be nice" [puppet] - 10https://gerrit.wikimedia.org/r/206118 (https://phabricator.wikimedia.org/T96898) (owner: 10Merlijn van Deen) [21:42:24] (03CR) 10Yuvipanda: "The topic branch is ok, I think - very zen." [puppet] - 10https://gerrit.wikimedia.org/r/206118 (https://phabricator.wikimedia.org/T96898) (owner: 10Merlijn van Deen) [21:43:41] 6operations, 10Wikimedia-DNS: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1239881 (10Dzahn) How about lawsuit.wikimedia.org, nsa.wikimedia.org or something similar? So something in .wikimedia.org vs. an entirely new domain name? [21:46:06] 6operations, 10Wikimedia-DNS: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1239884 (10Yana) It will host a static html site, similar to the [[ https://transparency.wikimedia.org/ | transparency report ]] and the [[ https://annual.wikimedia.org/2014/ | WMF annual report ]] and... [21:46:27] (03PS9) 10Merlijn van Deen: Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/206118 (https://phabricator.wikimedia.org/T96898) [21:47:10] (03PS10) 10Yuvipanda: Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/206118 (https://phabricator.wikimedia.org/T96898) (owner: 10Merlijn van Deen) [21:47:13] 6operations, 10Wikimedia-DNS: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1239889 (10Dzahn) @Yana Thanks! got it, we'll need to add it to DNS and also add some puppet code to setup a webserver to host it. i can do that and will upload patches. [21:47:17] valhallasw`cloud: alright, imma merge [21:47:28] (03CR) 10Yuvipanda: [C: 032 V: 032] Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/206118 (https://phabricator.wikimedia.org/T96898) (owner: 10Merlijn van Deen) [21:48:06] 6operations, 10Wikimedia-DNS: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1239904 (10Yana) Excellent! Really appreciate your help. [21:49:13] (03PS2) 10Merlijn van Deen: Add local crontab monitoring [puppet] - 10https://gerrit.wikimedia.org/r/206833 (https://phabricator.wikimedia.org/T96472) [21:49:26] 10Ops-Access-Requests, 6operations: Add ebernhardson to 'stats' group for query access to eventlogging data on stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97332#1239911 (10Dzahn) since this mentions "my.research.cnf " i would expect the researchers group to be the right one since that exists to g... [21:49:57] 10Ops-Access-Requests, 6operations: Add ebernhardson to 'stats' group for query access to eventlogging data on stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97332#1239912 (10Dzahn) [21:50:20] valhallasw`cloud: aren’t you going to set some as administrative? [21:50:23] like puppet for example [21:50:33] YuviPanda: root and puppet are set as administrative by default [21:50:37] oh [21:50:39] in the .py [21:50:41] or the .pp [21:50:45] I forgot :{ [21:50:48] the .py I think [21:50:53] I missed that .update [21:50:55] in the pi [21:50:56] py [21:51:17] yeah, there it is [21:51:20] valhallasw`cloud: looks good to merge to me. should I merge? [21:51:24] we might need to add more later [21:51:25] sure [21:51:32] (03PS3) 10Yuvipanda: Add local crontab monitoring [puppet] - 10https://gerrit.wikimedia.org/r/206833 (https://phabricator.wikimedia.org/T96472) (owner: 10Merlijn van Deen) [21:51:38] should give a few hits on tools-dev, but nothing more I think [21:51:41] (03CR) 10Yuvipanda: [C: 032 V: 032] Add local crontab monitoring [puppet] - 10https://gerrit.wikimedia.org/r/206833 (https://phabricator.wikimedia.org/T96472) (owner: 10Merlijn van Deen) [21:51:43] (03PS1) 10Dzahn: add policy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/206972 (https://phabricator.wikimedia.org/T97329) [21:51:51] and when we fix those, we can set an alert on > 1 [21:52:12] valhallasw`cloud: yeah [21:52:51] extendedexim is broken, it seems [21:52:52] bah. [21:53:10] exim returning non-expected lines? weird [21:53:30] valhallasw`cloud: heh, > IndexError: list index out of range [21:53:33] looksl ike [21:54:41] (03PS1) 10Dzahn: varnish: add misc-web config for policy.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/206974 (https://phabricator.wikimedia.org/T97329) [21:58:16] YuviPanda: I don't have time to debug it now, though :( we can either keep it this way (which doesn't really harm anyone other than filling the diamond log), or we can roll back [21:58:26] fscking diamond making debugging impossible [21:58:40] valhallasw`cloud: hmm, in that case let me roll back. [21:58:49] (y) [21:59:18] (03PS1) 10Yuvipanda: Revert "Extend Exim diamond collector for Tool Labs" [puppet] - 10https://gerrit.wikimedia.org/r/206976 [21:59:25] (03PS2) 10Yuvipanda: Revert "Extend Exim diamond collector for Tool Labs" [puppet] - 10https://gerrit.wikimedia.org/r/206976 [21:59:33] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "Extend Exim diamond collector for Tool Labs" [puppet] - 10https://gerrit.wikimedia.org/r/206976 (owner: 10Yuvipanda) [21:59:56] valhallasw`cloud: done [21:59:58] (03PS1) 10Dzahn: policy.wm.org: minimal module/role for microsite [puppet] - 10https://gerrit.wikimedia.org/r/206978 (https://bugzilla.wikimedia.org/97329) [22:00:00] (03PS1) 10Chmarkine: RT - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206977 (https://phabricator.wikimedia.org/T40516) [22:00:47] YuviPanda: basically can't get diamond to show any debug info in ~/eximparser [22:00:52] YuviPanda: https://phabricator.wikimedia.org/T97334#1239591 #2 seems like an interesting initiative do you have a task assigned for it? [22:00:57] [2015-04-27 22:00:23,871] [MainThread] Initialized Collector: ExtendedEximCollector [22:00:57] [2015-04-27 22:00:23,871] [MainThread] Skipped loading disabled Collector: ExtendedEximCollector [22:01:58] valhallasw`cloud: you can just live add / edit it on tools-mail and use log to debug [22:02:10] that feels icky :-p [22:02:11] (03PS2) 10Dzahn: policy.wm.org: minimal module/role for microsite [puppet] - 10https://gerrit.wikimedia.org/r/206978 (https://phabricator.wikimedia.org/T97329) [22:02:12] Negative24: no, but we can create one if you want :D [22:02:32] YuviPanda: also I have no clue how to add it manually except via puppet atm [22:02:32] Negative24: it should be really interesting. imagine letting each user run on a container that only lets them ssh to somewhere else and nothing else [22:02:35] because diamond [22:02:41] is an incomprehensible mess [22:03:14] valhallasw`cloud: ah. it just is a config file edited somewhere. look at the diamond define - it just adds a file to /etc/diamond/collectors/${collector}.conf with appropriate values [22:03:29] yeah, that;s the theory [22:03:33] YuviPanda: seems worthwhile even if that bug is declined. I've seen many users try and use bastion as a sort of work machine which they shouldn't and giving them an error would prevent issues. [22:03:39] of course diamond has a habit of randomly ignoring stuff :P [22:03:41] Negative24: yeah. [22:03:51] YuviPanda: container? [22:03:55] but I have to go to bed now [22:04:09] Negative24: well, or cgroups, or some other form of controlling that :) [22:04:10] valhallasw`cloud: <3 [22:08:47] YuviPanda: I just realized that reverting pupper doesnt actually remove the config file... [22:08:54] haha [22:08:56] ofc [22:09:05] YuviPanda: but ill take a look at it tomorrow [22:09:12] yeah that’s totally fine [22:09:35] (03PS1) 10Chmarkine: donate - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206979 (https://phabricator.wikimedia.org/T40516) [22:09:39] Negative24: we could even set people to use a constrained shell that didn’t allow anything other than sshing to elsewhere :) [22:10:11] YuviPanda: that's what I was thinking of [22:10:22] that would be the easiest solution [22:10:23] yeah [22:10:48] Negative24: and have everyone not ops default to that... [22:11:00] well, ops / admisn of bastion project [22:11:54] yep [22:12:26] do you have an ldap ops group? [22:12:42] yeah, I think so [22:12:43] 10Ops-Access-Requests, 6operations: Give Google webmaster tools access to jon katz (Read only is fine) - https://phabricator.wikimedia.org/T90980#1239981 (10Dzahn) 5Open>3Resolved a:3Dzahn talked to james. already done. (we shared the global login and there were more than just 20 to be added) [22:12:56] Negative24: actually, just the project owners for bastion should be enough [22:13:39] 10Ops-Access-Requests, 6operations: Add ebernhardson to 'stats' group for query access to eventlogging data on stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97332#1239988 (10Dzahn) p:5Triage>3Normal [22:14:11] YuviPanda: I can't think of another reason why ops would need all access to bastion if they aren't apart of bastion's admin group as well. They can't sudo [22:14:19] yeah. [22:14:35] so basically limiting everyone except people in projectadmin group to a silly shell should be good enough [22:15:00] yes but also keep git [22:15:11] (03PS1) 10Chmarkine: doc - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206980 (https://phabricator.wikimedia.org/T40516) [22:15:19] that's how I sync my dotfiles which include ssh configs [22:15:53] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/206515 (owner: 10Aklapper) [22:16:03] but we should also email around labs for other programs that need executing [22:17:14] Negative24: true. [22:19:13] (03PS1) 10Chmarkine: integration - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206981 (https://phabricator.wikimedia.org/T40516) [22:20:17] (03CR) 1020after4: [C: 031] Make scap localization cache build $TMPDIR aware [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) (owner: 10Thcipriani) [22:23:12] (03CR) 10Dzahn: [C: 032] Phab monthly stats email: Clarify what day values for priority mean [puppet] - 10https://gerrit.wikimedia.org/r/206515 (owner: 10Aklapper) [22:23:49] (03PS2) 10Dzahn: Phab monthly stats email: Show how many projects saw workboard moves [puppet] - 10https://gerrit.wikimedia.org/r/206518 (owner: 10Aklapper) [22:24:20] (03CR) 10Dzahn: [C: 032] "tested sql on m3-master. the result is 173" [puppet] - 10https://gerrit.wikimedia.org/r/206518 (owner: 10Aklapper) [22:25:28] (03PS1) 10Chmarkine: servermon - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206982 (https://phabricator.wikimedia.org/T40516) [22:29:37] (03PS1) 10Chmarkine: iegreview - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206983 (https://phabricator.wikimedia.org/T40516) [22:31:30] (03PS2) 10Dzahn: Update dispatchChanges cronjob to use new script location [puppet] - 10https://gerrit.wikimedia.org/r/205644 (owner: 10Aude) [22:31:59] PROBLEM - DPKG on cp1069 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:32:18] PROBLEM - DPKG on cp1070 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:32:18] PROBLEM - DPKG on cp1072 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:32:28] PROBLEM - DPKG on cp1071 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:32:32] bblack: ^ [22:32:38] PROBLEM - DPKG on cp1073 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:32:39] mutante: thanks a lot for merging my stuff! [22:32:55] andre__: you're welcome, np [22:33:08] PROBLEM - dhclient process on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:08] PROBLEM - Hadoop NodeManager on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:08] PROBLEM - DPKG on cp1074 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:33:58] PROBLEM - RAID on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:59] PROBLEM - configured eth on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:05] hm [22:34:06] uh oh [22:34:08] PROBLEM - Disk space on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:08] PROBLEM - SSH on analytics1016 is CRITICAL - Socket timeout after 10 seconds [22:34:09] PROBLEM - salt-minion processes on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:27] PROBLEM - puppet last run on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:28] PROBLEM - Hadoop DataNode on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:38] PROBLEM - DPKG on analytics1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:35:47] (03PS1) 10Chmarkine: annual - Raise HSTS max-age to 1 year and add "always" [puppet] - 10https://gerrit.wikimedia.org/r/206984 (https://phabricator.wikimedia.org/T599) [22:36:18] (03CR) 10Hoo man: [C: 031] "Ok to be merged." [puppet] - 10https://gerrit.wikimedia.org/r/205644 (owner: 10Aude) [22:36:33] !log powercycled analytics1016 after it is unreachable. [22:36:40] Logged the message, Master [22:37:39] PROBLEM - DPKG on cp1062 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:37:51] (03CR) 10Dzahn: [C: 032] "the old location is already:" [puppet] - 10https://gerrit.wikimedia.org/r/205644 (owner: 10Aude) [22:38:28] PROBLEM - DPKG on cp1059 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:38:28] PROBLEM - DPKG on cp1060 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:38:39] PROBLEM - DPKG on cp1057 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:38:48] PROBLEM - DPKG on cp1061 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:40:17] PROBLEM - Host analytics1016 is DOWN: PING CRITICAL - Packet loss = 100% [22:40:29] ^ blame "mlocate" [22:42:48] (03CR) 10Dzahn: "applied on terbium" [puppet] - 10https://gerrit.wikimedia.org/r/205644 (owner: 10Aude) [22:43:29] !log racreset on analytics1016 because no console [22:43:35] Logged the message, Master [22:43:47] PROBLEM - DPKG on cp1051 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:43:57] PROBLEM - DPKG on cp1054 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:44:17] PROBLEM - DPKG on cp1053 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:44:17] PROBLEM - DPKG on cp1056 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:44:20] ottomata: actually died it seems :/ [22:44:24] ottomata: ana1016 [22:44:25] yeah [22:44:30] can't even log into console now [22:44:43] yea, so i could but "connect com2" did nothing [22:44:43] console com2 just returns immediately [22:44:45] yeah [22:44:46] so i reset drac [22:44:47] PROBLEM - DPKG on cp1052 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:44:54] did that help? [22:44:55] now i could connect, but no output [22:44:58] hm [22:45:02] so that's me. I don't know how to mass-disable a dpkg alert though :P [22:45:05] it changed but it doesnt mean i see it doing anything :p [22:45:11] hm, ok. mutante, will file ticket [22:45:18] PROBLEM - DPKG on cp1055 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:46:19] 6operations, 10ops-eqiad: analytics1016 down - https://phabricator.wikimedia.org/T97349#1240076 (10Ottomata) 3NEW a:3Cmjohnson [22:46:24] mutante: ^ [22:46:45] (03PS2) 10Dereckson: Enable SandboxLink on es.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206735 [22:46:49] ACKNOWLEDGEMENT - DPKG on cp1046 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:49] ACKNOWLEDGEMENT - DPKG on cp1051 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:49] ACKNOWLEDGEMENT - DPKG on cp1052 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:49] ACKNOWLEDGEMENT - DPKG on cp1053 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:49] ACKNOWLEDGEMENT - DPKG on cp1054 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:50] ACKNOWLEDGEMENT - DPKG on cp1055 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:50] ACKNOWLEDGEMENT - DPKG on cp1056 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:51] ACKNOWLEDGEMENT - DPKG on cp1057 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:51] ACKNOWLEDGEMENT - DPKG on cp1059 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:52] ACKNOWLEDGEMENT - DPKG on cp1060 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:52] ACKNOWLEDGEMENT - DPKG on cp1061 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:53] bblack: done [22:46:53] ACKNOWLEDGEMENT - DPKG on cp1062 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:53] ACKNOWLEDGEMENT - DPKG on cp1069 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:54] ACKNOWLEDGEMENT - DPKG on cp1070 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn . [22:46:55] ottomata: cool [22:47:04] (03PS3) 10Dereckson: Enable SandboxLink on es.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206735 (https://phabricator.wikimedia.org/T97135) [22:47:38] mutante: where do you that? :) [22:47:39] ACKNOWLEDGEMENT - Host analytics1016 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T97349#1240076 [22:47:57] (03CR) 10Dereckson: "PS2: Rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206735 (https://phabricator.wikimedia.org/T97135) (owner: 10Dereckson) [22:48:04] 10Ops-Access-Requests, 6operations: Add ebernhardson to 'stats' group for query access to eventlogging data on stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97332#1240093 (10Ottomata) Yes, he needs to be in group researchers, for access to the file /etc/mysql/conf.d/research-client.cnf [22:48:50] bblack: in the web ui, i went to the list of criticals and there is a checkbox in the very first row, and when i check that it selects ALL below [22:49:16] oh I meant disable them from happening in the first place, sorry [22:49:18] RECOVERY - DPKG on cp1071 is OK: All packages OK [22:49:19] but thanks :) [22:49:28] RECOVERY - DPKG on cp1073 is OK: All packages OK [22:49:29] RECOVERY - DPKG on cp1062 is OK: All packages OK [22:49:48] RECOVERY - DPKG on cp1052 is OK: All packages OK [22:49:58] RECOVERY - DPKG on cp1074 is OK: All packages OK [22:50:18] RECOVERY - DPKG on cp1059 is OK: All packages OK [22:50:27] RECOVERY - DPKG on cp1060 is OK: All packages OK [22:50:28] RECOVERY - DPKG on cp1055 is OK: All packages OK [22:50:28] RECOVERY - DPKG on cp1069 is OK: All packages OK [22:50:28] RECOVERY - DPKG on cp1051 is OK: All packages OK [22:50:37] RECOVERY - DPKG on cp1057 is OK: All packages OK [22:50:47] RECOVERY - DPKG on cp1061 is OK: All packages OK [22:50:47] RECOVERY - DPKG on cp1054 is OK: All packages OK [22:50:48] RECOVERY - DPKG on cp1070 is OK: All packages OK [22:50:48] RECOVERY - DPKG on cp1072 is OK: All packages OK [22:50:58] RECOVERY - DPKG on cp1053 is OK: All packages OK [22:50:58] RECOVERY - DPKG on cp1056 is OK: All packages OK [22:51:43] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1240105 (10GWicke) @RobH, to evaluate the main options (single instance vs. multi-instance), could you get rough price estimates for these options? For comparison purposes, they pretend t... [22:52:32] 6operations, 10Analytics, 10Traffic: Fix annoying varnishncsa+initsystem issues on jessie - https://phabricator.wikimedia.org/T97351#1240106 (10BBlack) 3NEW [22:53:41] hopefully the rest won't break, but I never have much luck doing package updates and not managing to trigger some pointless icinga alerts heh [22:53:58] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [22:54:27] bblack: via the hostgroup page. Hostgroup Overview -> cache_upload_eqiad -> "view status summary for this host group" -> "unhandled" -> use that checkbox on top [22:55:20] ah, nevermind, that also doesn't tell you how to do it before, then they are not "unhandled" [22:55:32] so it's actually "add more servicegroups" i guess [22:55:38] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:55:46] what I want is the equivalent of planned-downtime on a host, but instead on a service across many hosts [22:55:51] because then you can mass-handle entire groups [22:57:13] yea, you can search for a string, like just "dpkg" and then use the result screen, but that only works if this check is unique [22:57:29] ah yeah [22:57:31] i think it would have to be "add service group for "dpkg on cp" [22:58:06] or that shell script and bash [22:58:24] or icinga could support boolean search queries on specific attributes, but that sounds like crazy 1980's technology they'll never catch up to :) [23:00:04] RoanKattouw, ^d, matt_flaschen, gwicke: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150427T2300). [23:00:11] bblack: do it manually for one of them and watch what that adds to /var/lib/nagios3/rw/nagios.cmd . then just write directly into it with a script replacing the hostname [23:01:02] hrmm.. i know :) bbl, gotta catch bart [23:02:52] 'evening [23:07:48] PROBLEM - puppet last run on mw2118 is CRITICAL puppet fail [23:08:05] (03CR) 1020after4: "who can we get to review this one?" [puppet] - 10https://gerrit.wikimedia.org/r/201344 (https://phabricator.wikimedia.org/T94754) (owner: 10BryanDavis) [23:08:13] (03PS1) 10Chmarkine: ishmael - Raise HSTS max-age to 1 year and add "always" [puppet] - 10https://gerrit.wikimedia.org/r/206992 (https://phabricator.wikimedia.org/T40516) [23:14:06] (03CR) 10BBlack: "does trebuchet always use group-level perms (it's never more correct to use 022 instead 002?)" [puppet] - 10https://gerrit.wikimedia.org/r/201344 (https://phabricator.wikimedia.org/T94754) (owner: 10BryanDavis) [23:18:38] PROBLEM - Host mw2072 is DOWN: PING CRITICAL - Packet loss = 100% [23:19:22] (03CR) 10BryanDavis: "@BBlack: the problem we are attempting to solve here is that when trebuchet runs checkout commands on the deployment server (tin) it curre" [puppet] - 10https://gerrit.wikimedia.org/r/201344 (https://phabricator.wikimedia.org/T94754) (owner: 10BryanDavis) [23:20:18] RECOVERY - Host mw2072 is UPING OK - Packet loss = 0%, RTA = 43.01 ms [23:21:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:21:55] Who is SWATing? [23:22:33] matt_flaschen, RoanKattouw, gwicke? [23:22:53] James_F, RoanKattouw is, he'll BRB [23:23:02] matt_flaschen: He denies it. [23:23:15] I didn't denied [23:23:22] Deny it [23:23:24] Whatever [23:23:25] SWAT time! [23:23:40] !wmbot Duh-nuh Yes I am doing it. [23:23:48] Eh, whatever. [23:24:06] Hi RoanKattouw. [23:24:21] Hey Dereckson [23:24:24] I'll do your config patches first [23:24:29] RECOVERY - puppet last run on mw2118 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:25:01] Fine, I've my test plan ready at http://etherpad.wikimedia.org/p/deploy-20150427-SWAT-evening. [23:25:52] (03CR) 10BBlack: "I get that. Our default umask is 022, which prevents group/other write, and what you're putting here is 002, which only prevents other-wr" [puppet] - 10https://gerrit.wikimedia.org/r/201344 (https://phabricator.wikimedia.org/T94754) (owner: 10BryanDavis) [23:25:59] (03CR) 10Catrope: [C: 032] Enable NewUserMessage on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206733 (https://phabricator.wikimedia.org/T96823) (owner: 10Dereckson) [23:26:02] (03CR) 10Catrope: [C: 032] Enable SandboxLink on es.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206735 (https://phabricator.wikimedia.org/T97135) (owner: 10Dereckson) [23:26:06] (03Merged) 10jenkins-bot: Enable NewUserMessage on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206733 (https://phabricator.wikimedia.org/T96823) (owner: 10Dereckson) [23:26:08] (03CR) 10Catrope: [C: 032] Removed autoreview group on fr.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206848 (https://phabricator.wikimedia.org/T90979) (owner: 10Dereckson) [23:26:12] (03CR) 10Catrope: [C: 032] Content namespaces on fr.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206719 (https://phabricator.wikimedia.org/T97228) (owner: 10Dereckson) [23:26:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [23:26:53] (03Merged) 10jenkins-bot: Enable SandboxLink on es.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206735 (https://phabricator.wikimedia.org/T97135) (owner: 10Dereckson) [23:26:59] (03Merged) 10jenkins-bot: Removed autoreview group on fr.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206848 (https://phabricator.wikimedia.org/T90979) (owner: 10Dereckson) [23:27:03] (03Merged) 10jenkins-bot: Content namespaces on fr.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206719 (https://phabricator.wikimedia.org/T97228) (owner: 10Dereckson) [23:28:22] !log catrope Synchronized wmf-config/flaggedrevs.php: Remove autoreview group on frwikinews (duration: 00m 35s) [23:28:27] Logged the message, Master [23:28:48] bblack: ah. On the trebuchet target hosts (eg mw*) the ownership is root:root. In theory we only care about this group permission support on the deploy server. In practice I'm not sure if adding the complexity to switch umask by host/role is worth the trouble. Input obviously welcome however. [23:28:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:29:00] !log catrope Synchronized wmf-config/InitialiseSettings.php: SWAT (duration: 00m 26s) [23:29:03] Logged the message, Master [23:29:21] The use of 022 was already in some places in the code, but not all [23:29:27] 206848 ok [23:29:47] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1240298 (10RobH) update from irc: The rt ticket has another variant quote already. I've requested that Gabriel sync up with Filippo on a hardware specification discussion before I keep r... [23:30:21] 206735 ok [23:30:23] well, you're taking a risk there I guess. instead of needing to break uid=0 to modify, they only have to break gid=0, which might be easier in some hypothetical future exploit? I donno. [23:30:37] bblack: true enough [23:30:39] bd808: if it's relatively easy to only use the group-write stuff where it matters, that would be nice. [23:30:48] gwicke: You around for your config patch? [23:31:14] if it means 5,000 line refactor of puppet stuff just to make the distinction, maybe not :) [23:31:25] bblack: the "right" thing to do might really be to make the provider=>trebuchet magic not run checkouts on the deploy server [23:31:46] RoanKattouw: for checking, yes [23:32:03] OK, will deploy it now [23:32:11] Anyone know why https://wikitech.wikimedia.org/wiki/Incident_documentation/20150406-Flow isn't showing up at https://wikitech.wikimedia.org/wiki/Incident_documentation ? [23:32:14] bblack: That's where the breakage starts. Ryan didn't anticipate running a trebuchet checkout on the trebuchet master server from what I can tell [23:32:18] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [23:32:28] (03CR) 10Catrope: [C: 032] Use /api/rest_v1/ entry point for VE, take two. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206502 (owner: 10GWicke) [23:32:34] (03Merged) 10jenkins-bot: Use /api/rest_v1/ entry point for VE, take two. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206502 (owner: 10GWicke) [23:32:48] well, I know nothing about trebuchet, and I'm still hoping I can escape life without ever having to [23:33:28] !log catrope Synchronized wmf-config/CommonSettings.php: Re-enable same-domain RESTbase entry point for VE (duration: 00m 22s) [23:33:33] Logged the message, Master [23:33:36] gwicke: Done [23:34:44] RoanKattouw: looking good [23:34:49] https://www.mediawiki.org/wiki/VisualEditor/Design?veaction=edit [23:35:25] gwicke: does this mean we can now get some stats over the next day or two to compare vs before on moving the endpoints out? [23:35:28] RoanKattouw: Deployment looks good to me. [23:35:48] bblack: Not much load from MediaWiki.org sadly. [23:35:52] bblack: not yet, no [23:35:57] oh, not the right "group" [23:35:59] tto: You around for your SWAT patch? [23:35:59] bblack: Next up is to enable for other wikis. [23:36:00] And matt_flaschen [23:36:08] ok [23:36:12] Here [23:36:23] RoanKattouw: Just realised it is not needed and removed it from the SWAT list; sorry for the hassle [23:36:28] OK thanks [23:36:31] Less work for me :) [23:38:22] 206733 ok [23:39:28] and 206719 ok. [23:39:59] Thanks for the deploy Roan. [23:43:15] (03CR) 10BryanDavis: "The other way to attack this problem would be to ensure that Trebuchet doesn't try to fetch and checkout on the deployment server itself. " [puppet] - 10https://gerrit.wikimedia.org/r/201344 (https://phabricator.wikimedia.org/T94754) (owner: 10BryanDavis) [23:43:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 3 below the confidence bounds [23:51:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:53:57] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [23:54:38] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0]