[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161111T0000). Please do the needful. [00:00:05] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:44] (03CR) 10Reedy: "Pretty much. Want to test the script manually on beta first to check the internals still work... Not sure they've been run for a while!" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [00:01:29] My patch is the only one, so I'll do the SWAT [00:01:45] (03PS2) 10Catrope: Enable {{NOINDEX}} as a noindex template on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319348 (https://phabricator.wikimedia.org/T149538) [00:02:13] (03CR) 10Catrope: [C: 032] Enable {{NOINDEX}} as a noindex template on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319348 (https://phabricator.wikimedia.org/T149538) (owner: 10Catrope) [00:02:45] (03Merged) 10jenkins-bot: Enable {{NOINDEX}} as a noindex template on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319348 (https://phabricator.wikimedia.org/T149538) (owner: 10Catrope) [00:03:27] (03CR) 10Legoktm: "PageTriage has switched to extension.json, so there's no need for $wg = $wmg anymore." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319348 (https://phabricator.wikimedia.org/T149538) (owner: 10Catrope) [00:05:14] RECOVERY - Disk space on elastic1024 is OK: DISK OK [00:07:46] ^ checked with discovery, that was a reindex, it needs a lot more disk but only temp [00:09:04] RoanKattouw: You should be able to test it with the article https://en.wikipedia.org/wiki/Youssif_Isa [00:10:03] Thanks man [00:10:09] RoanKattouw: It currently doesn't have a noindex tag, but it should after the change. [00:10:54] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:10:58] (03CR) 10Catrope: "Good point. There were already two wmg's there, so I assumed I needed one too. I'll clean them all up in one go afterwards." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319348 (https://phabricator.wikimedia.org/T149538) (owner: 10Catrope) [00:12:30] (03PS2) 10Kaldari: Removing registered trademark symbol from footer of Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320865 [00:12:36] yay, it works on mw1099 [00:13:24] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:13:47] kaldari: You want that trademark one to ride along too? [00:13:50] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Enable {{NOINDEX}} as a noindex template on enwiki (1/2) (T149538) (duration: 00m 49s) [00:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:57] T149538: Noindex template feature should be restricted to new articles - https://phabricator.wikimedia.org/T149538 [00:14:04] RoanKattouw: Oh, sure [00:14:36] (03PS3) 10Kaldari: Removing registered trademark symbol from footer of Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320865 (https://phabricator.wikimedia.org/T95007) [00:15:04] (03CR) 10Catrope: [C: 032] Removing registered trademark symbol from footer of Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320865 (https://phabricator.wikimedia.org/T95007) (owner: 10Kaldari) [00:15:34] (03Merged) 10jenkins-bot: Removing registered trademark symbol from footer of Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320865 (https://phabricator.wikimedia.org/T95007) (owner: 10Kaldari) [00:15:39] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable {{NOINDEX}} as a noindex template on enwiki (2/2) (T149538) (duration: 00m 47s) [00:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:22] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Remove registered trademark symbol from officewiki footer (T95007) (duration: 00m 48s) [00:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:28] T95007: Improve trademark code in MobileFrontend - https://phabricator.wikimedia.org/T95007 [00:23:07] !log swift eqiad-prod: ms-be1027 to weight 1000 - T136631 [00:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:14] T136631: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631 [00:24:23] 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2341913 (10fgiunchedi) [00:24:25] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2787963 (10fgiunchedi) 05Open>03Resolved thanks @Cmjohnson for taking care of this! LGTM now, will progressively put the machine in service in {T136631} [00:24:43] 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2341913 (10fgiunchedi) a:05Cmjohnson>03fgiunchedi [00:24:54] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:32:25] ACKNOWLEDGEMENT - MD RAID on ms-be1027 is CRITICAL: CRITICAL: Active: 5, Working: 5, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T150498 [00:32:28] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1027 - https://phabricator.wikimedia.org/T150498#2787976 (10ops-monitoring-bot) [00:36:39] wah wah waaaaahhh [00:37:03] :) [00:37:16] you mean the auto-ack, right [00:39:41] no the fact that the host failed _again_ [00:42:24] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [00:45:44] 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2788018 (10fgiunchedi) [00:45:47] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2788015 (10fgiunchedi) 05Resolved>03Open I spoke way too soon, machine still reports failures on SSDs as in P4409 :( Looks like to me it might be just DOA? [00:48:25] godog: owww...ok [00:49:35] ACKNOWLEDGEMENT - HP RAID on ms-be1027 is CRITICAL: CRITICAL: Slot 3: Failed: 2I:4:1, 2I:4:2 - OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T150500 [00:49:38] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1027 - https://phabricator.wikimedia.org/T150500#2788019 (10ops-monitoring-bot) [00:53:22] 06Operations, 13Patch-For-Review: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986#2738609 (10greg) Obligatory UBN! priority check-in after 2.5 weeks. Is that prio still valid? Should this be prioritized within some team more highly? There's a relate... [00:53:36] godog: That is awesome news!!! Happy that is over [00:53:54] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [00:54:28] so, its is it over or is it a second fail? [00:55:03] the latter [00:55:09] cmjohnson: not over :((((( [00:55:30] .... [00:56:16] that server is going to be the death of me [00:58:57] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2788085 (10Liuxinyu970226) [00:59:42] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Icinga should alert on free disk space < 15% on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329#2788089 (10Dzahn) 05Resolved>03Open < ebernhardson> mutante: thanks for the ping, but in general you don't have to worry about... [01:01:05] seriously [01:01:54] maybe not worth it.. hardware donation to other non-profit ? [01:03:37] nah it is under warranty heh [01:04:02] i've replaced just about everything...guess now I need another disk [01:04:30] !log revert swift ring change for ms-be1027 [01:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:11] (03PS1) 10BBlack: Test write buffer size theory for extra RTT [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320939 [01:05:11] cmjohnson: sigh, including replacgin the controller? [01:05:13] (03PS1) 10BBlack: nginx (1.11.4-1+wmf15) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320940 [01:05:27] I hate HP [01:05:34] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [01:06:34] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [01:09:10] cmjohnson: I have to run, the machine is now in icinga though so it'll alarm if you take it down, it is otherwise in your hands [01:09:42] okay...thx for letting me know. I will take a hammer to it in the morning! ;-) [01:11:00] (03PS1) 10Dduvall: [WIP] contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) [01:13:51] (03PS1) 10Dzahn: mgmt: fix typos in getmgmtips script [puppet] - 10https://gerrit.wikimedia.org/r/320943 [01:16:13] (03PS2) 10Dzahn: mgmt: fix typos in getmgmtips script [puppet] - 10https://gerrit.wikimedia.org/r/320943 [01:16:34] (03CR) 10Dzahn: [C: 032] mgmt: fix typos in getmgmtips script [puppet] - 10https://gerrit.wikimedia.org/r/320943 (owner: 10Dzahn) [01:17:04] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:25:06] PROBLEM - Disk space on ms-be1027 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb3 is not accessible: Input/output error [01:25:06] PROBLEM - swift-container-replicator on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [01:25:07] PROBLEM - swift-account-auditor on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [01:25:07] PROBLEM - swift-object-server on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [01:25:14] PROBLEM - swift-account-reaper on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [01:25:24] PROBLEM - swift-object-updater on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [01:25:24] PROBLEM - swift-container-server on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [01:25:24] PROBLEM - swift-account-replicator on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [01:25:24] PROBLEM - swift-container-updater on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [01:25:34] PROBLEM - swift-account-server on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [01:25:44] PROBLEM - swift-object-auditor on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [01:25:44] PROBLEM - swift-container-auditor on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:25:54] PROBLEM - swift-object-replicator on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [01:29:20] ummm godog are you working on this ^ [01:30:48] (03PS2) 10BBlack: Test another write buffer size theory for extra RTT [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320939 [01:30:50] (03PS2) 10BBlack: nginx (1.11.4-1+wmf15) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320940 [01:32:37] (03PS1) 10Dzahn: mgmt: add success/fail logs to changepw [puppet] - 10https://gerrit.wikimedia.org/r/320945 [01:37:05] incoming ... really quick gerrit restart for config change [01:37:16] (03PS6) 10Dzahn: Gerrit: Up the size for packedGitLimit to 2gb [puppet] - 10https://gerrit.wikimedia.org/r/317322 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [01:38:51] (03CR) 10Dzahn: [C: 032] Gerrit: Up the size for packedGitLimit to 2gb [puppet] - 10https://gerrit.wikimedia.org/r/317322 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [01:39:51] !log gerrit restarting for config change 317322 (T148478) [01:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:59] T148478: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478 [01:40:38] grrrit-wm: restart [01:40:40] re-connecting to gerrit [01:40:41] reconnected to gerrit [01:40:44] sweet [01:40:47] and done [01:41:15] hopefully that will help with performance of gerrit now [01:43:04] (03PS1) 10Madhuvishy: labstore: Check that NFS is being served over Cluster IP for secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/320946 (https://phabricator.wikimedia.org/T144633) [01:45:04] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [01:45:06] !log gerrit now has higher "packedGitLimit" of 2g, goal is to reduce Gerrit slowdowns [01:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:52] (03CR) 10Dzahn: "done. gerrit restarted." [puppet] - 10https://gerrit.wikimedia.org/r/317322 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [01:49:35] (03CR) 10Madhuvishy: [C: 032] labstore: Check that NFS is being served over Cluster IP for secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/320946 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [01:49:53] (03CR) 10Madhuvishy: labstore: Check that NFS is being served over Cluster IP for secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/320946 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [01:50:33] (03PS2) 10Madhuvishy: labstore: Check that NFS is being served over Cluster IP for secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/320946 (https://phabricator.wikimedia.org/T144633) [01:51:06] madhuvishy: yeah that was me, renewed the downtime, thanks ! [01:51:19] godog: okay cool :) [01:52:03] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788190 (10Dzahn) We have now increased the packedGitLimit setting to 2g. Like @20after4 originally said on [1] "2... [01:52:12] (03CR) 10Madhuvishy: [C: 032] labstore: Check that NFS is being served over Cluster IP for secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/320946 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [01:57:24] (03PS1) 10Madhuvishy: labstore: Rename secondary cluster monitoring descriptions [puppet] - 10https://gerrit.wikimedia.org/r/320949 [01:58:50] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1027 - https://phabricator.wikimedia.org/T150500#2788197 (10fgiunchedi) 05Open>03Invalid See also T140374 [01:58:51] (03CR) 10Madhuvishy: [C: 032] labstore: Rename secondary cluster monitoring descriptions [puppet] - 10https://gerrit.wikimedia.org/r/320949 (owner: 10Madhuvishy) [01:59:28] (03PS1) 10BBlack: test commit, 8k default buffer [debs/openssl11] - 10https://gerrit.wikimedia.org/r/320950 [01:59:31] (03PS1) 10BBlack: openssl (1.1.0c-1+wmf2) jessie-wikimedia; urgency=medium [debs/openssl11] - 10https://gerrit.wikimedia.org/r/320951 [02:00:38] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2788204 (10GWicke) The main benefit of encoding the original dimensions in the URL would be consistency across formats, and some amount of ease of use.... [02:03:40] (03CR) 10Dzahn: "hmm .. http://puppet-compiler.wmflabs.org/4584/" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [02:08:16] (03CR) 10Dzahn: [C: 04-1] "I moved it to the ipmi module, but this doesn't install it globally as intended, this just installs it on puppetmaster, bast4001 and saltm" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [02:18:08] (03PS5) 10Dzahn: base/ipmi: install freeipmi globally, move to ipmi module [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) [02:20:32] (03CR) 10Dzahn: "So it would have to be like PS5 then to work. Add a second class in mdoule ipmi that just installs the packages and include that in base. " [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [02:23:11] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.2) (duration: 04m 56s) [02:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:58] (03CR) 10Dzahn: "the compiler says there would be no change but that's not true, bug T149432. if you look at the actual catalog the freeipmi packages get i" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [02:28:24] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Nov 11 02:28:24 UTC 2016 (duration 5m 14s) [02:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:20] (03CR) 10Papaul: [C: 032] mgmt: add success/fail logs to changepw [puppet] - 10https://gerrit.wikimedia.org/r/320945 (owner: 10Dzahn) [02:46:52] (03CR) 10Papaul: "Tested and works" [puppet] - 10https://gerrit.wikimedia.org/r/320945 (owner: 10Dzahn) [02:50:59] (03PS2) 10Papaul: mgmt: add success/fail logs to changepw [puppet] - 10https://gerrit.wikimedia.org/r/320945 (owner: 10Dzahn) [03:05:00] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1811.361844 Seconds [03:06:00] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [03:06:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [03:07:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [03:18:40] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:20:00] PROBLEM - puppet last run on mc1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:25:23] (03PS1) 10Dzahn: fix mgmt names in wrong data center [dns] - 10https://gerrit.wikimedia.org/r/320954 [03:29:41] (03PS2) 10Dzahn: fix mgmt names in wrong data center [dns] - 10https://gerrit.wikimedia.org/r/320954 [03:35:35] (03PS3) 10Dzahn: fix mgmt names in wrong data center [dns] - 10https://gerrit.wikimedia.org/r/320954 [03:39:21] (03PS1) 10Dzahn: consistent capitalization of mgmt asset tag names [dns] - 10https://gerrit.wikimedia.org/r/320959 [03:47:17] (03PS2) 10Dzahn: consistent capitalization of mgmt asset tag names [dns] - 10https://gerrit.wikimedia.org/r/320959 [03:47:41] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [03:48:00] RECOVERY - puppet last run on mc1036 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [03:53:25] (03PS4) 10Dzahn: fix mgmt names in wrong data center [dns] - 10https://gerrit.wikimedia.org/r/320954 (https://phabricator.wikimedia.org/T149875) [04:00:53] (03CR) 10Dzahn: "Host wmf3138.mgmt.eqiad.wmnet. not found: 3(NXDOMAIN)" [dns] - 10https://gerrit.wikimedia.org/r/320954 (https://phabricator.wikimedia.org/T149875) (owner: 10Dzahn) [04:06:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [04:09:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [05:25:08] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2788286 (10Naveenpf) @CRoslof This is an enhancement request. If someone take wikipedia.in now it is redirecting to new URL. There is no point in... [05:42:40] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:05:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [06:07:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [06:10:40] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:27:30] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:44:30] PROBLEM - MD RAID on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:30] RECOVERY - MD RAID on thumbor1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [06:54:28] (03PS1) 10Madhuvishy: labstore: Set mailto address for secondary backups cron [puppet] - 10https://gerrit.wikimedia.org/r/320962 (https://phabricator.wikimedia.org/T144633) [06:55:50] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:56:30] PROBLEM - MD RAID on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:30] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:20] RECOVERY - MD RAID on thumbor1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [07:17:06] 06Operations, 10ops-codfw, 10DBA: db2035: RAID disk about to fail - https://phabricator.wikimedia.org/T150511#2788294 (10Marostegui) [07:24:50] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:26:08] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1068 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320963 [07:27:27] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1068 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320963 [07:30:22] (03PS2) 10Madhuvishy: labstore: Set mailto address for secondary backups cron [puppet] - 10https://gerrit.wikimedia.org/r/320962 (https://phabricator.wikimedia.org/T144633) [07:30:55] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1068 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320963 (owner: 10Marostegui) [07:31:26] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1068 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320963 (owner: 10Marostegui) [07:31:32] (03CR) 10Madhuvishy: [C: 032 V: 032] labstore: Set mailto address for secondary backups cron [puppet] - 10https://gerrit.wikimedia.org/r/320962 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [07:33:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1068 - T149079 (duration: 00m 48s) [07:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:17] T149079: codfw: Fix S4 commonswiki.templatelinks partitions - https://phabricator.wikimedia.org/T149079 [07:33:58] 06Operations: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788316 (10Peachey88) p:05Triage>03Unbreak! [07:34:10] someone want to look at https://phabricator.wikimedia.org/T150503 please? [07:34:24] legoktm: if youa re still around^ [07:38:34] (03CR) 10Marostegui: mariadb-labs: Prepare db1095 to be the new sanitarium host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/320752 (https://phabricator.wikimedia.org/T149829) (owner: 10Jcrespo) [07:52:47] 06Operations, 13Patch-For-Review: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986#2788321 (10MoritzMuehlenhoff) We already have monitoring for this (implicitly via the connection tracking Icinga check), but more explicit monitoring is under way via... [08:27:43] p858snake|L2: can you reproduce it? [08:28:59] 06Operations: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788064 (10Legoktm) Creating worked for me. Is this happening for anyone besides yourself? [08:44:59] 06Operations, 10MediaWiki-General-or-Unknown: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788354 (10Joe) [08:45:38] <_joe_> are we sure it's an UBN! ticket? [08:46:42] <_joe_> it seems like a thing that's important but not something we should work on non-stop with maximum priority [08:49:06] (03CR) 10Volans: [C: 04-1] "See inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [08:53:33] _joe_: it only affects a few hosts and was quickly spotted via the failing conntrack check, but we can just as well keep the prio, I'm working on the dedicated Icinga check later the day [08:53:57] <_joe_> moritzm: what are you referring to? [08:54:08] <_joe_> I was referring to T150503 [08:54:08] T150503: Failure to save recent changes - https://phabricator.wikimedia.org/T150503 [08:54:46] oh, sorry, I thought you were referring to T148986, which I commented a few lines above [08:54:46] T148986: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986 [08:56:41] _joe_: tbh, if people can't save edits, yes its UBN [08:56:47] legoktm: bit busy to check now [08:57:05] <_joe_> p858snake|L2: I agree, but it's a single report AFAICS [08:57:42] <_joe_> from a few hours ago, if there are more, I agree with you [08:58:08] <_joe_> if not, it can be treated within the flow of "high" priority tickets, IMHO [08:58:48] <_joe_> that's why I asked for opinions :) [09:02:40] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:21:21] (03PS2) 10Muehlenhoff: Configure connection tracking sysctl settings in ferm [puppet] - 10https://gerrit.wikimedia.org/r/320590 (https://phabricator.wikimedia.org/T136094) [09:21:32] (03CR) 10Muehlenhoff: [C: 04-2] Configure connection tracking sysctl settings in ferm [puppet] - 10https://gerrit.wikimedia.org/r/320590 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff) [09:26:17] (03CR) 10Muehlenhoff: "I tested the approach of setting the sysctl settings in a ferm configuration sub file in https://gerrit.wikimedia.org/r/#/c/320590/, but t" [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff) [09:30:40] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:33:29] (03PS3) 10Elukey: Disable connection tracking for kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/320758 (owner: 10Muehlenhoff) [09:37:51] (03CR) 10Elukey: [C: 032] Disable connection tracking for kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/320758 (owner: 10Muehlenhoff) [09:38:50] disabled puppet on kafka analytics, will run puppet only on one broker first for --^ [09:38:57] !log Deploy schema change s4 commonswiki.revision db1069 - T147305 [09:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:07] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [09:50:26] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2788425 (10Marostegui) The data copy finished and after running mysql_upgrade I have started replication and the slaves are catching up nicely with the master. I forgot to include the RAID config... [09:51:21] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2788426 (10Marostegui) @Papaul the disks still need to be wiped, is that something you can do or something we have to do? I will leave this ticket open until you let us know. Thanks [09:54:17] (03PS6) 10Muehlenhoff: Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 [10:03:31] (03PS7) 10Muehlenhoff: Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 [10:04:34] (03PS3) 10Muehlenhoff: Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) [10:05:04] (03CR) 10jenkins-bot: [V: 04-1] Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff) [10:05:19] !log increasing apache log level on mw1284 (depooling, applying config manually, re-pooling with lower weight) for a 503 investigation [10:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:49] (03CR) 10jenkins-bot: [V: 04-1] Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff) [10:05:50] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [10:06:50] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [10:14:55] !log Deploy alter table dbstore1002 s4 commonswiki.revision - T147305 [10:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:02] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [10:19:24] (03PS8) 10Muehlenhoff: Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 [10:20:19] (03CR) 10Alexandros Kosiaris: "Answered all inline comments, @volans, I also did some basic state mapping as you suggested." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) (owner: 10Alexandros Kosiaris) [10:20:46] (03PS2) 10Alexandros Kosiaris: Introduce a system wide systemd check [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) [10:22:15] (03PS4) 10Muehlenhoff: Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) [10:32:06] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788464 (10ArielGlenn) This setting change means that we'll have more things in memory and that (logically) GC pause... [10:35:10] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 17 failures. Last run 2 minutes ago with 17 failures. Failed resources (up to 3 shown): Service[ferm],Service[diamond],Service[prometheus-node-exporter],Package[ecryptfs-utils] [10:38:00] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788466 (10ema) [10:40:22] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788473 (10ema) Log captured with `varnishlog -n frontend -g request -q 'RespStatus eq 503'` ``` * << Request >> 629660955 - Begin req 629660954 rxreq - Timest... [10:41:56] 06Operations, 10Traffic, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2788475 (10ema) p:05Triage>03Normal [10:42:59] 06Operations, 10Traffic: 503 errors for users connecting to esams - https://phabricator.wikimedia.org/T149865#2788477 (10ema) 05Open>03Resolved [10:51:50] !log restored mw1284 to its normal settings [10:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:01] (03PS5) 10Muehlenhoff: Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) [10:54:17] (03CR) 10jenkins-bot: [V: 04-1] Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff) [10:56:53] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788495 (10Paladox) @ArielGlenn so should we revert? We should try CMS? [10:58:46] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788499 (10ArielGlenn) Just leave it for now. If the logs show a sharp enough increase in pause times, I'll report... [10:59:43] !log cp3043 depooled, testing https://phabricator.wikimedia.org/P4406 (T150503) [10:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:48] T150503: Failure to save recent changes - https://phabricator.wikimedia.org/T150503 [11:07:10] PROBLEM - Varnish HTTP text-backend - port 3128 on cp3043 is CRITICAL: connect to address 10.20.0.178 and port 3128: Connection refused [11:07:35] that's me, should be fixed soon ^ [11:08:10] RECOVERY - Varnish HTTP text-backend - port 3128 on cp3043 is OK: HTTP OK: HTTP/1.1 200 OK - 177 bytes in 0.168 second response time [11:10:15] !log cp3043 repooled with gethdr_extrachance=100 (T150503) [11:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:21] T150503: Failure to save recent changes - https://phabricator.wikimedia.org/T150503 [11:10:38] (03PS1) 10Alexandros Kosiaris: grafana: Provision the Server Board dashboard as JSON [puppet] - 10https://gerrit.wikimedia.org/r/320972 [11:23:20] PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:24:23] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2724169 (10MoritzMuehlenhoff) Now we have gerrit running on Debian we also have the option to use openjdk-8 instead... [11:31:10] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [11:34:20] PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:35:11] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [11:45:50] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [11:48:50] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [12:11:53] (03PS2) 10Muehlenhoff: Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) [12:13:05] (03CR) 10jenkins-bot: [V: 04-1] Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff) [12:14:32] (03PS3) 10Muehlenhoff: Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) [12:46:28] moritzm: where is sudo being called? [12:46:48] 06Operations, 13Patch-For-Review: Cleanup debconf handling in mailman puppet setup - https://phabricator.wikimedia.org/T144933#2788706 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None [12:48:01] 06Operations, 06Labs, 13Patch-For-Review: 4.4-series kernel vs. iptables - https://phabricator.wikimedia.org/T142388#2788708 (10MoritzMuehlenhoff) 05Open>03Resolved This has been fixed, all labvirt systems are running Linux 4.4 for a while now. [12:48:24] (03PS1) 10BBlack: VCL: fixups for synthetic error status [puppet] - 10https://gerrit.wikimedia.org/r/320975 [12:49:22] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788712 (10ArielGlenn) >>! In T148478#2788533, @MoritzMuehlenhoff wrote: > Now we have gerrit running on Debian we a... [12:52:57] paravoid: oops, fixed [12:53:13] (03PS4) 10Muehlenhoff: Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) [12:53:57] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788735 (10Paladox) I could do this on the test instance I am using, but it may not work with gerrit 2.12 but may wi... [12:54:30] (03CR) 10jenkins-bot: [V: 04-1] Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff) [13:00:01] (03PS5) 10Muehlenhoff: Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) [13:01:09] (03CR) 10jenkins-bot: [V: 04-1] Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff) [13:05:02] (03PS6) 10Muehlenhoff: Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) [13:05:22] (03PS2) 10BBlack: VCL: fixups for synthetic error status [puppet] - 10https://gerrit.wikimedia.org/r/320975 [13:06:25] (03CR) 10jenkins-bot: [V: 04-1] Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff) [13:08:06] (03PS7) 10Muehlenhoff: Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) [13:11:05] !log installing curl security updates [13:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:50] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:12:14] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788776 (10elukey) From the httpd point of view: There are a lot of 503s logged for GET requests for /w/api.php like the following: ``` 2016-11-11T12:07:44 59999926 10.64.0.1... [13:15:25] 06Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#2788778 (10MoritzMuehlenhoff) [13:31:46] (03CR) 10Faidon Liambotis: "LGTM ­— the dependencies (requires) are probably excessive/not very useful (the sudo user doesn't really require the file, and the nrpe de" [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff) [13:32:02] (03CR) 10Faidon Liambotis: [C: 031] Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff) [13:34:13] (03CR) 10Faidon Liambotis: [C: 04-1] "See inline for a syntax error. I also still hate the _traditional part. Long-lived certificates are still the norm, and I think having a s" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk) [13:35:11] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 641 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3050721 keys, up 11 days 5 hours - replication_delay is 641 [13:36:10] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3031423 keys, up 11 days 5 hours - replication_delay is 0 [13:40:38] (03PS1) 10DCausse: [WIP] test job jenkins with mw-core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320980 (https://phabricator.wikimedia.org/T143932) [13:40:50] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:41:17] 06Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#2788834 (10akosiaris) From a quick look into the Changelogs, 2.7 has nothing backwards incompatible that should worry us, 2.6 does however. Specifically `The aio=native option to "-drive" now requires the cache=none... [13:41:26] (03CR) 10jenkins-bot: [V: 04-1] [WIP] test job jenkins with mw-core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320980 (https://phabricator.wikimedia.org/T143932) (owner: 10DCausse) [13:42:28] (03PS3) 10Giuseppe Lavagetto: RESTBase config: Use special project for wikidata domains. [puppet] - 10https://gerrit.wikimedia.org/r/320529 (owner: 10Ppchelko) [13:43:26] (03PS2) 10DCausse: [WIP] test job jenkins with mw-core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320980 (https://phabricator.wikimedia.org/T115713) [13:44:10] (03CR) 10jenkins-bot: [V: 04-1] [WIP] test job jenkins with mw-core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320980 (https://phabricator.wikimedia.org/T115713) (owner: 10DCausse) [13:46:43] (03CR) 10Giuseppe Lavagetto: [C: 032] RESTBase config: Use special project for wikidata domains. [puppet] - 10https://gerrit.wikimedia.org/r/320529 (owner: 10Ppchelko) [13:51:10] (03CR) 10DCausse: [C: 04-1] "test patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320980 (https://phabricator.wikimedia.org/T115713) (owner: 10DCausse) [14:02:24] !log restarting hhvm on canary app servers to pick up libcurl update [14:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:46] !log restarting RESTBase to pick up https://gerrit.wikimedia.org/r/#/c/320529/ [14:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:00] PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199 [14:09:51] RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [14:17:10] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:18:10] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [14:29:53] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788887 (10ema) We've been able to reproduce the bug on pinkunicorn by closing the connection before sending Content-Length bytes as follows: ``` #!/usr/bin/env python import... [14:48:15] (03PS1) 10Alexandros Kosiaris: profile::docker::builder: Conditionalize hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/320985 [14:59:35] (03PS1) 10Muehlenhoff: Update to 4.4.31 [debs/linux44] - 10https://gerrit.wikimedia.org/r/320986 [15:06:45] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788905 (10elukey) Even simpler: ``` curl -d "Hola!" --header "Content-Length: 120" --header "Host: en.wikipedia.org" localhost/w/api.php ``` I checked the httpd trunk code an... [15:22:15] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:15] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [15:25:52] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788064 (10Joe) So to be a bit more precise on what happens on apache: `mod_proxy_fcgi` reads the request body in a loop, when it gets to the end of input according to the cont... [15:26:52] (03PS2) 10Muehlenhoff: Update to 4.4.31 [debs/linux44] - 10https://gerrit.wikimedia.org/r/320986 [15:28:15] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2701526 (10akosiaris) There is one issue I 'd like to (re?)touch on. Whether explicit hiera() lookups in profiles should have defaults or not (I am assu... [15:32:13] (03PS1) 10Marostegui: mariadb: Split backup class into a different file [puppet] - 10https://gerrit.wikimedia.org/r/320989 [15:37:52] (03CR) 10Marostegui: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/4586/" [puppet] - 10https://gerrit.wikimedia.org/r/320989 (owner: 10Marostegui) [15:40:50] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.31 [debs/linux44] - 10https://gerrit.wikimedia.org/r/320986 (owner: 10Muehlenhoff) [15:44:25] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788931 (10Dzahn) Since the original now asks for a login, here's the Google cache version to why this was done: ht... [15:49:28] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2788933 (10Andrew) A few things about http://garylarizza.com/blog/2014/02/17/puppet-workflow-part-2/: 1) That argument is premised on a given user hav... [15:57:00] (03Abandoned) 10Muehlenhoff: Configure connection tracking sysctl settings in ferm [puppet] - 10https://gerrit.wikimedia.org/r/320590 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff) [15:57:56] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2788943 (10MoritzMuehlenhoff) [16:04:19] (03PS1) 10Gehel: Imported Upstream version 1.11.0 [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320991 [16:04:21] (03PS1) 10Gehel: New upstream version: 1.11.0 [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320992 (https://phabricator.wikimedia.org/T150408) [16:04:28] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2788948 (10akosiaris) >>! In T147718#2788933, @Andrew wrote: > A few things about http://garylarizza.com/blog/2014/02/17/puppet-workflow-part-2/: > > 1... [16:07:37] (03CR) 10Muehlenhoff: New upstream version: 1.11.0 (031 comment) [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320992 (https://phabricator.wikimedia.org/T150408) (owner: 10Gehel) [16:10:46] (03PS2) 10Gehel: New upstream version: 1.11.0 [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320992 (https://phabricator.wikimedia.org/T150408) [16:20:25] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [16:21:25] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3038251 keys, up 11 days 7 hours - replication_delay is 0 [16:21:29] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2788982 (10Andrew) I really don't know how to engage when you assert that you are unable to understand how implicit lookups work. They're unfamiliar an... [16:21:51] (03PS1) 10Bmansurov: MF Beta: Enable moving first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320993 (https://phabricator.wikimedia.org/T149830) [16:27:04] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2788993 (10Aklapper) Outsider comment: The task summary currently says "Point wikipedia.in to 205.147.101.160 instead of URL forward". If I curr... [16:28:58] (03CR) 10Muehlenhoff: [C: 031] "I haven't reviewed the patches (and whether they are still needed with the new upstream release) but looks fine in general" [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320992 (https://phabricator.wikimedia.org/T150408) (owner: 10Gehel) [16:34:00] (03PS1) 10Rush: tools nfsclient: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/320995 [16:36:22] (03CR) 10Rush: [C: 032] tools nfsclient: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/320995 (owner: 10Rush) [16:37:05] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2789004 (10Naveenpf) Hi Aklapper, We are having multiple websites in same server. We are doing the same for all other Indic websites. [root@e2... [16:38:53] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2789017 (10Marostegui) [16:44:05] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:51:39] (03PS1) 10Rush: nfsclient: fix dependency issue with scratch [puppet] - 10https://gerrit.wikimedia.org/r/320999 [16:51:47] (03PS1) 10Ema: Revert "tlsproxy: turn proxy_request_buffering off for v4" [puppet] - 10https://gerrit.wikimedia.org/r/321000 (https://phabricator.wikimedia.org/T150503) [16:53:56] (03CR) 10Rush: [C: 032 V: 032] nfsclient: fix dependency issue with scratch [puppet] - 10https://gerrit.wikimedia.org/r/320999 (owner: 10Rush) [16:54:58] (03Abandoned) 10Rush: WIP: candidate idea for secondary backups [puppet] - 10https://gerrit.wikimedia.org/r/319365 (owner: 10Rush) [16:55:12] (03PS2) 10Ema: Revert "tlsproxy: turn proxy_request_buffering off for v4" [puppet] - 10https://gerrit.wikimedia.org/r/321000 (https://phabricator.wikimedia.org/T150503) [16:55:20] (03CR) 10Ema: [C: 032 V: 032] Revert "tlsproxy: turn proxy_request_buffering off for v4" [puppet] - 10https://gerrit.wikimedia.org/r/321000 (https://phabricator.wikimedia.org/T150503) (owner: 10Ema) [17:00:35] (03PS1) 10Madhuvishy: labstore: Dual mount tools from labstore1001 and labstore-secondary [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) [17:07:15] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:09:32] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2789074 (10madhuvishy) [17:09:55] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [17:10:05] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2652289 (10madhuvishy) [17:10:55] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [17:12:05] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:13:41] (03CR) 10Rush: labstore: Dual mount tools from labstore1001 and labstore-secondary (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) (owner: 10Madhuvishy) [17:13:45] (03CR) 10Filippo Giunchedi: [C: 032] grafana: Provision the Server Board dashboard as JSON [puppet] - 10https://gerrit.wikimedia.org/r/320972 (owner: 10Alexandros Kosiaris) [17:13:54] (03PS2) 10Filippo Giunchedi: grafana: Provision the Server Board dashboard as JSON [puppet] - 10https://gerrit.wikimedia.org/r/320972 (owner: 10Alexandros Kosiaris) [17:15:33] (03CR) 10Madhuvishy: labstore: Dual mount tools from labstore1001 and labstore-secondary (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) (owner: 10Madhuvishy) [17:16:45] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2789082 (10madhuvishy) [17:22:53] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2789087 (10ema) We've set nginx's proxy_request_buffering back on: https://gerrit.wikimedia.org/r/#/c/321000/ and that seems to help. [17:23:55] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/grafana/dashboards/server-board.json] [17:31:19] (03CR) 10Rush: labstore: Dual mount tools from labstore1001 and labstore-secondary (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) (owner: 10Madhuvishy) [17:31:55] (03PS3) 10Rush: labs: add ores_classification and ores_model tables [puppet] - 10https://gerrit.wikimedia.org/r/320804 (https://phabricator.wikimedia.org/T148561) (owner: 10Ladsgroup) [17:33:55] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:36:15] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:37:37] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2789124 (10chasemp) [17:39:02] (03PS2) 10Madhuvishy: labstore: Dual mount tools from labstore1001 and labstore-secondary [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) [17:55:15] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 57 failures. Last run 2 minutes ago with 57 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [17:56:32] (03PS1) 10Filippo Giunchedi: fixup for I13b135e4 [puppet] - 10https://gerrit.wikimedia.org/r/321012 [17:57:02] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] fixup for I13b135e4 [puppet] - 10https://gerrit.wikimedia.org/r/321012 (owner: 10Filippo Giunchedi) [18:01:55] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [18:04:55] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [18:05:05] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:06:33] 06Operations: Evaluate use of systemd-timesyncd on jessie for clock synchronisation - https://phabricator.wikimedia.org/T150257#2789160 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [18:08:55] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [18:12:14] (03PS14) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [18:16:48] (03PS15) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [18:17:25] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:18:25] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [18:20:25] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [18:21:25] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3043153 keys, up 11 days 9 hours - replication_delay is 0 [18:23:01] (03PS1) 10Yuvipanda: Add libenchant to python(2)? base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/321013 (https://phabricator.wikimedia.org/T143449) [18:25:37] (03PS5) 10Alex Monk: Split check_ssl between traditional year-long certs and LE's 3 month certs [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) [18:27:04] (03PS6) 10Alex Monk: Split check_ssl between traditional year-long certs and LE's 3 month certs [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) [18:29:38] (03CR) 10Madhuvishy: [C: 032] Add libenchant to python(2)? base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/321013 (https://phabricator.wikimedia.org/T143449) (owner: 10Yuvipanda) [18:30:12] (03Merged) 10jenkins-bot: Add libenchant to python(2)? base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/321013 (https://phabricator.wikimedia.org/T143449) (owner: 10Yuvipanda) [18:34:32] (03PS16) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [18:52:28] (03PS17) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [19:02:15] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [19:04:55] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [19:08:35] (03CR) 10Faidon Liambotis: [C: 04-1] Split check_ssl between traditional year-long certs and LE's 3 month certs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk) [19:09:55] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:10:14] ^^ that's not me restarting grrrit-wm [19:10:23] i haven't restarted it today [19:19:15] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:20:33] (03CR) 10Rush: [C: 031] labstore: Dual mount tools from labstore1001 and labstore-secondary [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) (owner: 10Madhuvishy) [19:42:26] Hey guys? [19:43:02] Just wondering… is it one of you peeps that’s poking broken transcodes back through the queue on Commons? [19:44:25] Those hour+ HD files won’t successfully get through unless they are run one-per-sever at a time… they time out after 6 hours or so, if run several at a time. [19:44:31] *server [19:45:31] Someone put 4x transcodes of a 2.57GB file on there, at once… it will not work. [19:48:15] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:51:06] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:51:55] PROBLEM - puppet last run on mc1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:54:13] (03PS3) 10Madhuvishy: labstore: Dual mount tools from labstore1001 and labstore-secondary [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) [19:54:20] (03CR) 10Madhuvishy: [C: 032 V: 032] labstore: Dual mount tools from labstore1001 and labstore-secondary [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) (owner: 10Madhuvishy) [19:55:05] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:04:05] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:07:14] (03PS1) 10Madhuvishy: labstore: Fix service urls for secondary nfs cluster [puppet] - 10https://gerrit.wikimedia.org/r/321017 [20:08:05] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:08:55] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [20:08:56] (03CR) 10Madhuvishy: [C: 032] labstore: Fix service urls for secondary nfs cluster [puppet] - 10https://gerrit.wikimedia.org/r/321017 (owner: 10Madhuvishy) [20:10:55] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [20:15:45] Hey everyone, I have a question about beta cluster configuration - anyone here to help ? [20:16:49] Question - PHP reads host name from config - key `Server` [20:17:20] I just want to check what's under that key for `wikipedia.beta.wmflabs.org` [20:17:23] raynor #wikimedia-releng [20:17:44] thx Zppix [20:17:54] no problem [20:19:55] RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [20:23:05] PROBLEM - puppet last run on ms-be1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:19] Reedy, or greg-g around? [20:24:35] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=300.60 Read Requests/Sec=3115.60 Write Requests/Sec=5.50 KBytes Read/Sec=20630.80 KBytes_Written/Sec=2303.60 [20:26:12] anyone? [20:27:00] grrrit-wm: restart [20:27:08] re-connecting to gerrit [20:27:09] reconnected to gerrit [20:27:17] grrrit-wm: force-restart [20:27:19] re-connecting to gerrit and irc. [20:27:55] grrrit-wm: nick [20:28:00] re-connected to gerrit and irc. [20:28:14] grrrit-wm: nick [20:28:19] Nick is already grrrit-wm not changing the nick. [20:28:20] Nick is already grrrit-wm not changing the nick. [20:28:25] grrrit-wm: help [20:28:27] My current commands are: grrrit-wm: restart, grrrit-wm: force-restart, and grrrit-wm: nick [20:28:37] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2789330 (10matmarex) It appears that the vast majority of... [20:33:27] I am so glad I don't stalk my username on IRC. [20:34:17] lol [20:36:35] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=185.60 Read Requests/Sec=164.70 Write Requests/Sec=2.40 KBytes Read/Sec=3716.40 KBytes_Written/Sec=370.00 [20:37:25] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:37:30] bastion-3's ip is blocked on enwiki atm a bot got logged out or something and was editing as bastion-3 [20:37:33] just fyi [20:40:15] grrrit-wm: restart [20:40:22] re-connecting to gerrit [20:40:25] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [20:50:08] (03PS6) 10Filippo Giunchedi: Initial commit [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) [20:51:05] RECOVERY - puppet last run on ms-be1009 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [20:52:13] (03CR) 10Filippo Giunchedi: "Minimal test scaffolding added" [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [20:53:50] 06Operations, 05Prometheus-metrics-monitoring: Deploy federation for Prometheus - https://phabricator.wikimedia.org/T150486#2789344 (10fgiunchedi) [20:54:31] jouncebot now [20:54:32] No deployments scheduled for the next 65 hour(s) and 5 minute(s) [20:54:38] (03PS1) 10Madhuvishy: exec-manage: Change order of params to support xargs for node names [puppet] - 10https://gerrit.wikimedia.org/r/321022 [20:56:13] (03CR) 10Madhuvishy: [C: 032] exec-manage: Change order of params to support xargs for node names [puppet] - 10https://gerrit.wikimedia.org/r/321022 (owner: 10Madhuvishy) [20:58:06] (03PS1) 10Hashar: jenkins: disable cli [puppet] - 10https://gerrit.wikimedia.org/r/321023 [20:59:54] (03PS2) 10ArielGlenn: jenkins: disable cli [puppet] - 10https://gerrit.wikimedia.org/r/321023 (owner: 10Hashar) [21:02:29] (03CR) 10ArielGlenn: [C: 032] jenkins: disable cli [puppet] - 10https://gerrit.wikimedia.org/r/321023 (owner: 10Hashar) [21:06:14] !log Restarted Jenkins [21:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:55] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [21:08:15] PROBLEM - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused [21:08:32] checking [21:08:37] thank you [21:08:54] didn't have that issue earlier [21:08:55] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [21:11:15] RECOVERY - jenkins_zmq_publisher on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 8888 [21:11:57] !log jenkins: disabled/reenabled the ZMQ Event Publisher. Apparently it refused to start [21:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:31] silly thing [21:12:51] I am restarting it again to confirm [21:15:40] apergos: that was a one time error [21:15:48] great [21:15:58] !log Restarted Jenkins. This time ZMQ managed to bind to port 8888 [21:16:01] sounds good to me [21:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:10] PROBLEM - MariaDB disk space on labsdb1004 is CRITICAL: DISK CRITICAL - free space: /srv/labsdb 122775 MB (5% inode=99%) [21:26:41] wah wah, I'll take a look [21:27:09] godog: ping me if you need a hand ;) [21:27:17] grrrit-wm: restart [21:27:39] godog: volans thanks guys :) I think that's the toolsdb slave? I can't recall [21:27:46] grrrit-wm: restart [21:28:50] volans: thanks! [21:29:06] yeah chasemp I think it is up for reimporting for the jessie migration, https://phabricator.wikimedia.org/T123731 [21:29:34] or not, the sal mentioned only a reimport [21:31:13] looks like to me the space free has been steadily declined, though there's still space free on the VG [21:32:21] godog: I would say increase the volume [21:34:07] marostegui: I agree, looks like it started yesterday to go down significantly though [21:34:24] marostegui: look if there is any offender that is loading a ton of data [21:34:27] it happened in the past [21:34:58] that some maintenance or other changes on some tools used a lot of data [21:35:55] also what was the magic incantation for the mysql client? I'm getting SSL certificate validation failure [21:36:01] godog: you gave it 100G right? [21:36:10] RECOVERY - MariaDB disk space on labsdb1004 is OK: DISK OK [21:36:18] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2789393 (10Joe) >>! In T147718#2788982, @Andrew wrote: > Am I really the only one out here in favor of simply using the language /as it is designed/?... [21:36:19] godog: —skip-ssl but for labsdb it is disable, you need to go thru neodymium in this case [21:36:22] marostegui: no I didn't touch it [21:36:27] maybe a quick cache clear on the server that was using alot of disk space could help? [21:36:32] volans: nothing is using it now [21:36:49] akosiaris perhaps expanded it [21:36:49] godog: Interesting the size of the vg went from 2T to 2.1T [21:36:53] ah [21:36:55] ok [21:37:07] marostegui: I've put a watch with du and saw the data directory growing (ofc I would say), I'm looking inside [21:37:57] marostegui: yeah I noticed also no .my.cnf, thanks anyways! [21:39:54] there are looots of binlogs for today and yesterday [21:39:56] more than usual [21:40:07] yes every few minutes [21:40:26] 100MB, 3.4MB, 339 bytes [21:41:25] https://phabricator.wikimedia.org/P4411 [21:41:51] anything on processlist? [21:41:57] nope [21:42:10] looks like it isn't polled by prometheus ;_; [21:43:28] the last binlog has stopped growing so much [21:43:33] so whatever it is stopped [21:43:56] marostegui: did you changed it's role in the last days? look bytes in/out on tendril [21:44:01] since yesterday changed a lot [21:44:06] no [21:46:28] looks like the same increase in bytes is also on its master labsdb1005 [21:47:13] so I guess someone importing stuff? as I said, the binlog has now stopped [21:47:47] marostegui: from my diffs no single DB has grown in this short amount of time, looks like the space was all from relay logs, that make me think of some update/replace activity [21:48:27] oh, interesting if a db didn't increase, it was probably the relay logs then yes [21:50:09] ufff... mysqlbinlog: unknown variable 'default-character-set=utf8mb4' [21:50:37] volans: alias mysqlbinlog='/opt/wmf-mariadb10/bin/mysqlbinlog --defaults-file=/root/.my.cnf' [21:50:52] there is no root/.my.cnf there ;) [21:50:59] ah true :( [21:51:27] Since this is no longer an issue I am going to logoff (i am in a restaurant XD) I will take a look if needed tomorrow! [21:51:28] but /dev/null works ;) [21:51:31] thanks guys for the help [21:51:48] np, bye marostegui ! [22:05:55] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [22:07:55] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [22:17:05] PROBLEM - MD RAID on ms-be2025 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:17:15] PROBLEM - Docker registry HTTP interface on darmstadtium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:17:55] PROBLEM - very high load average likely xfs on ms-be2025 is CRITICAL: CRITICAL - load average: 249.72, 133.53, 66.40 [22:19:38] poop [22:21:42] godog: so the RAID check is actually a load check :-P [22:22:05] RECOVERY - Docker registry HTTP interface on darmstadtium is OK: HTTP OK: HTTP/1.1 200 OK - 2460 bytes in 0.731 second response time [22:22:15] PROBLEM - Disk space on ms-be2025 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sda3 is not accessible: Input/output error [22:23:52] heheh sort of [22:23:56] PROBLEM - swift-container-replicator on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [22:23:56] PROBLEM - swift-object-replicator on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [22:23:56] PROBLEM - swift-account-replicator on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [22:23:56] PROBLEM - swift-container-auditor on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:23:56] PROBLEM - swift-container-updater on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [22:24:09] shush [22:24:35] PROBLEM - swift-object-auditor on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [22:28:55] RECOVERY - very high load average likely xfs on ms-be2025 is OK: OK - load average: 1.46, 0.31, 0.10 [22:28:55] RECOVERY - swift-container-replicator on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [22:28:55] RECOVERY - swift-container-auditor on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:28:55] RECOVERY - MD RAID on ms-be2025 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [22:29:05] RECOVERY - swift-container-updater on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [22:29:05] RECOVERY - swift-account-replicator on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [22:29:05] RECOVERY - swift-object-replicator on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [22:29:15] RECOVERY - Disk space on ms-be2025 is OK: DISK OK [22:29:35] RECOVERY - swift-object-auditor on ms-be2025 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [22:37:58] (03PS1) 10Filippo Giunchedi: admin: revoke jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/321081 [22:39:35] (03CR) 10Filippo Giunchedi: [C: 032] admin: revoke jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/321081 (owner: 10Filippo Giunchedi) [22:39:45] (03PS2) 10Filippo Giunchedi: admin: revoke jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/321081 [22:40:33] whats going on with jforrester? [22:40:59] (03CR) 10Filippo Giunchedi: [V: 032] admin: revoke jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/321081 (owner: 10Filippo Giunchedi) [22:42:33] Zppix: not sure yet [22:43:16] godog should we riot xD (resuming normal development and stuff i was doing) [22:43:36] haha no no worries Zppix [22:59:31] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2610832 (10Shoichi) Months ago, I can log in,but it also happen to me. I don't know rember I had set two-factor authentication or not. Can someone help me? >_< My account there is also "shoichi" [23:00:50] grrrit-wm: nick [23:00:51] Nick is already grrrit-wm not changing the nick. [23:00:59] grrrit-wm: restart [23:01:25] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:01:35] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:01:35] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:25] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:25] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:25] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:25] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:25] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:26] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:26] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:27] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:27] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:28] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:28] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:29] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:29] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:02:30] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get [23:03:41] -.- [23:04:03] I'll take a look [23:04:15] thanks, I'm about ready to check out (1am) [23:04:56] hehe enjoy the weekend apergos [23:05:05] thanks, have one yourself soon [23:05:55] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [23:10:55] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [23:11:55] bah there was a restbase deployment earlier today but I'm not sure why the check failed only now [23:13:29] godog: there was no deployment today, only yesterday [23:13:53] Pchelolo: err yeah a config deployment rather [23:14:05] https://gerrit.wikimedia.org/r/#/c/320529 namely [23:14:18] godog: that shouldbe unrelated, it's for wikidata, we're checking on en.wikipedia [23:15:43] Pchelolo: true, any idea on what it could be? [23:15:59] godog: I'm looking right now, all seems fine.. [23:16:33] I've made the same request checker does via curl - looks ok.. [23:16:58] but the checker script is failing indeed [23:18:32] i dont know who to talk to but i just wanted to make you guys aware that 10.68.17.205 is blocked on enwiki due to a loggout/anom (confirmed) bot however that ip is confirmed from tools labs possible for you guys to investigate? [23:20:54] Pchelolo: what puzzles me the most is why now [23:21:56] Pchelolo maybe it lies in a DB that restbase pulls from? just a thought i have 0 clue about restbase execept that its widely used in MW-related stuff [23:22:20] godog: no idea literally. And I'm not quite sure what's happening, RB responds to curl request, nothing in the logs, we didn't deploy anything related since yesterday morning SD [23:25:27] godog: got it! the thumbnail of the page is not returned any more for some reason, but we're expecting it [23:26:00] godog: here we go, some vandalism: https://en.wikipedia.org/w/index.php?title=Barack_Obama&oldid=749031760 [23:26:11] it should fix itself in a bit [23:27:05] ah ah! [23:27:36] once the linksupdatejob will catch up and the pageimage property comes back [23:27:41] false alarm :) [23:28:20] Pchelolo if it happens again pm me i am a rollbacker on english wikipedia [23:29:09] kk Zppix thank you [23:29:15] what a min why was vandalism causing a restbase issue? [23:29:18] wait* [23:30:14] I'll file a task for service checker to be a bit more verbose, not sure how you debugged it though Pchelolo [23:31:02] godog: ye, that message was quite misterious [23:34:28] (03CR) 10Alex Monk: Split check_ssl between traditional year-long certs and LE's 3 month certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk) [23:36:19] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [23:36:19] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [23:36:19] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [23:36:19] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [23:36:19] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [23:36:20] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [23:36:20] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [23:36:21] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [23:36:21] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [23:36:22] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [23:36:22] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [23:36:23] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [23:36:23] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [23:36:24] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [23:36:31] here we go ^^^ [23:36:39] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [23:36:39] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [23:38:22] 06Operations, 06Operations-Software-Development, 06Services: More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2789590 (10fgiunchedi) [23:38:28] nice, filed as ^ [23:39:10] 06Operations, 06Operations-Software-Development, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2789602 (10Pchelolo) [23:47:04] 06Operations, 10Traffic: Extra RTT on TLS handshakes - https://phabricator.wikimedia.org/T150561#2789604 (10BBlack) [23:47:47] (03PS7) 10Alex Monk: Split check_ssl between traditional year-long certs and LE's 3 month certs [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) [23:52:07] 06Operations, 10Traffic: Extra RTT on TLS handshakes - https://phabricator.wikimedia.org/T150561#2789618 (10BBlack)