[00:00:13] twentyafterfour: ^ [00:00:36] upgrade itself will happen tomorrow during CST, but depooling now [00:00:36] (03PS1) 10Yuvipanda: Make HostPathAutomounter work for files with . in them [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/343797 [00:00:41] (for eqiad elasticsearch) [00:00:48] ok [00:01:05] i guess the timing was good [00:01:17] what's the advantage to depool it right now instead of during the upgrade? [00:01:18] re: eqiad depooled for writes [00:01:21] :) [00:01:54] we can wait [00:02:12] thanks Dereckson [00:02:13] we just need to send it before wmf17 [00:02:14] all working [00:02:40] mutante: you see some advantage to do it now? [00:04:31] (03PS1) 10Chad: Nova/wikistatus: Stop pointless page creations and deletions of noisy projects [puppet] - 10https://gerrit.wikimedia.org/r/343798 [00:04:54] !log mwscript deleteEqualMessages.php on public wikis (T45917) [00:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:02] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [00:05:41] Amir1: ping? [00:05:48] Dereckson: pong [00:06:01] (03PS2) 10Dereckson: Enable ORES review tool in etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343777 (https://phabricator.wikimedia.org/T159609) (owner: 10Ladsgroup) [00:06:03] (03PS2) 10Chad: Nova/wikistatus: Stop dumb page creations/deletions of noisy projects [puppet] - 10https://gerrit.wikimedia.org/r/343798 [00:06:04] Dereckson: no, i don't know. i just meant that i merged the change that stops phab from writing to it and next thing i see eqiad is being depooled [00:06:14] ok [00:06:45] Krinkle: Should we set that up as a cron to run every so often? [00:06:51] (monthly? weekly?) [00:06:55] dcausse: ebernhardson: yes I guess it's best to do it during as a part of the migration, so you can have a full procedure to test carefully the new unit tests, deploy in the right order, etc. [00:07:08] RainbowSprinkles: Not sure. For now I'd like to look at the output each time. But I"m not sure what I'm looking for to be honest. [00:07:14] Not anymore, after running it so many times [00:07:32] Dereckson: sure, I'll send it tomorrow [00:07:53] Krinkle: Fair enough, just came to mind is all :) [00:07:56] (03CR) 10BryanDavis: [C: 031] "untested, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/343798 (owner: 10Chad) [00:08:00] RainbowSprinkles: Yeah, definitely. [00:08:29] (03PS6) 10Dzahn: puppet-lint: ignore 'lines over 140 chars' warnings [puppet] - 10https://gerrit.wikimedia.org/r/322907 [00:08:40] (03CR) 10Dereckson: [C: 032] Enable ORES review tool in etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343777 (https://phabricator.wikimedia.org/T159609) (owner: 10Ladsgroup) [00:09:09] Amir1: you handle the script part? [00:09:23] (03PS1) 10Yuvipanda: k8s: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/343799 [00:09:25] (03PS5) 10Andrew Bogott: nfs-exportd: Refresh service if script or .yaml changes. [puppet] - 10https://gerrit.wikimedia.org/r/343459 [00:09:26] Dereckson: sure [00:09:39] (03CR) 10jerkins-bot: [V: 04-1] k8s: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/343799 (owner: 10Yuvipanda) [00:09:41] (03PS2) 10Yuvipanda: k8s: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/343799 [00:09:42] thanks mutante the new config looks good [00:09:53] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:10:06] Okay, zuul is a little slow, we wait for tests. [00:10:10] twentyafterfour: ok, great [00:10:11] (03Merged) 10jenkins-bot: Enable ORES review tool in etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343777 (https://phabricator.wikimedia.org/T159609) (owner: 10Ladsgroup) [00:10:19] (03CR) 10jenkins-bot: Enable ORES review tool in etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343777 (https://phabricator.wikimedia.org/T159609) (owner: 10Ladsgroup) [00:10:43] live on mwdebug1002, but as we don't have tables I don't think it's testable? [00:11:14] Amir1: normally, regular order is (1) db (2) config [00:11:15] Dereckson: Do you create the tables or I do it? [00:11:21] you can do it, yes [00:11:35] I make the tables now [00:11:45] (03CR) 10Yuvipanda: [V: 032 C: 032] k8s: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/343799 (owner: 10Yuvipanda) [00:11:47] ok [00:12:06] (03CR) 10BryanDavis: [C: 031] "Blocks I4e77802b3a634289558ab204cd05a333f6dd950f" [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/343797 (owner: 10Yuvipanda) [00:12:53] RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [00:12:54] by the way, I checked createExtensionTables.php, ores is covered [00:13:03] (03PS1) 10Yuvipanda: k8s: Fix another typo while eating hat [puppet] - 10https://gerrit.wikimedia.org/r/343800 [00:13:14] (03PS2) 10Yuvipanda: k8s: Fix another typo while eating hat [puppet] - 10https://gerrit.wikimedia.org/r/343800 [00:13:20] (03CR) 10Yuvipanda: [V: 032 C: 032] k8s: Fix another typo while eating hat [puppet] - 10https://gerrit.wikimedia.org/r/343800 (owner: 10Yuvipanda) [00:13:40] !log mwscript maintenance/sql.php --wiki=etwiki extensions/ORES/sql/(ores_model|ores_classification).sql (T159609) [00:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:47] T159609: Deploy ORES review tool to etwiki - https://phabricator.wikimedia.org/T159609 [00:13:50] Dereckson: Yeah, I made that patch [00:13:53] indeed, https://phabricator.wikimedia.org/rEWMAec7e675fe4880005077c8e1312c133ac09b08855 [00:13:57] but I wasn't sure so I didn't [00:14:08] I use it next time [00:14:40] Dereckson: tables are up now [00:15:06] here, it would have been `mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=etwiki ores` [00:15:25] Noted [00:15:27] Thanks [00:15:36] (or just mwscript extensions/WikimediaMaintenance/createExtensionTables.php etwiki ores) [00:16:42] So in mwdebug1002, I can see ORES in https://et.wikipedia.org/wiki/Eri:Versioon [00:16:56] 2 Undefined index: rc_logid in /srv/mediawiki/php-1.29.0-wmf.16/includes/changes/RecentChange.php on line 338 [00:17:14] Amir1: related with your change? [00:17:28] nope [00:17:29] (03PS6) 10Andrew Bogott: nfs-exportd: Refresh service if script or .yaml changes. [puppet] - 10https://gerrit.wikimedia.org/r/343459 [00:18:12] Dereckson: I think there's a task about that index error [00:18:30] (03CR) 10Dzahn: [C: 032] puppet-lint: ignore 'lines over 140 chars' warnings [puppet] - 10https://gerrit.wikimedia.org/r/322907 (owner: 10Dzahn) [00:18:36] (03PS7) 10Dzahn: puppet-lint: ignore 'lines over 140 chars' warnings [puppet] - 10https://gerrit.wikimedia.org/r/322907 [00:18:38] Hmm maybe not [00:18:43] (03CR) 10Dzahn: [V: 032 C: 032] puppet-lint: ignore 'lines over 140 chars' warnings [puppet] - 10https://gerrit.wikimedia.org/r/322907 (owner: 10Dzahn) [00:19:19] https://phabricator.wikimedia.org/search/query/odat0BQeuUWC/#R [00:23:08] (03PS4) 10Dzahn: apache: Get rid of redirects for non-resolving/parked domains [puppet] - 10https://gerrit.wikimedia.org/r/285084 (https://phabricator.wikimedia.org/T105981) (owner: 10Alex Monk) [00:23:23] Amir1: so? [00:23:42] I'm running CheckModelVersions.php and it fatals [00:23:52] ladsgroup@terbium:/srv/mediawiki/php-1.29.0-wmf.16$ mwscript extensions/ORES/maintenance/CheckModelVersions.php --wiki=etwiki [00:23:52] Starting...Fatal error: Class 'ORES\Api' not found in /srv/mediawiki/php-1.29.0-wmf.16/extensions/ORES/maintenance/CheckModelVersions.php on line 631 [00:25:00] yeah it's only on mwdebug1002 currently, let's sync to terbium too [00:25:16] done [00:25:19] Oh, okay [00:25:20] thanks [00:25:49] It looks okay to me [00:26:09] but regarding rc_logid, where can I see it? [00:26:24] fatalmonitor in fluorine? [00:26:29] yes [00:26:51] and according the mwdebug1002 dashboard on Logstash, it's not from this server [00:26:54] fluorine is dead though :) mwlog* [00:27:23] `ssh -t mwlog1001.eqiad.wmnet fatalmonitor` could be helpful [00:27:58] Oh, I need an update [00:28:37] in mwdebug1002, everything looks okay to [00:28:47] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable ORES review tool in etwiki (T159609) (duration: 00m 42s) [00:28:48] good [00:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:54] T159609: Deploy ORES review tool to etwiki - https://phabricator.wikimedia.org/T159609 [00:30:12] !log ladsgroup@terbium:/srv/mediawiki/php-1.29.0-wmf.16$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=etwiki (T159609) [00:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:30] It works just fine for me [00:30:30] https://et.wikipedia.org/w/index.php?title=Eri:Viimased_muudatused&hidenondamaging=1 [00:30:34] (enable it) [00:34:25] (03CR) 10Dzahn: [C: 032] "double checked, they are all dead" [puppet] - 10https://gerrit.wikimedia.org/r/285084 (https://phabricator.wikimedia.org/T105981) (owner: 10Alex Monk) [00:35:23] o/ [00:35:28] So, SWAT is done. [00:36:30] Dereckson: Thanks [00:37:15] (03CR) 10Dzahn: "applied on mwdebug1001 first, no issues" [puppet] - 10https://gerrit.wikimedia.org/r/285084 (https://phabricator.wikimedia.org/T105981) (owner: 10Alex Monk) [00:37:16] !log ladsgroup@terbium:/srv/mediawiki/php-1.29.0-wmf.16$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=etwiki is done now (T159609) [00:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:22] T159609: Deploy ORES review tool to etwiki - https://phabricator.wikimedia.org/T159609 [00:41:09] (03CR) 10Dzahn: [C: 031] Move some production apache config files to templates [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [00:52:53] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:04:31] (03PS1) 10Smalyshev: Add Blazegraph options for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/343803 (https://phabricator.wikimedia.org/T160969) [01:05:31] (03CR) 10jerkins-bot: [V: 04-1] Add Blazegraph options for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/343803 (https://phabricator.wikimedia.org/T160969) (owner: 10Smalyshev) [01:06:47] (03PS2) 10Smalyshev: Add Blazegraph options for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/343803 (https://phabricator.wikimedia.org/T160969) [01:21:53] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [01:55:13] PROBLEM - puppet last run on dbmonitor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:10:53] (03CR) 10Krinkle: "Follows-up 86d5293b80a5cfac. They used to work. The canonical one is wikimediastories.org and is linked from various offline and online pu" [puppet] - 10https://gerrit.wikimedia.org/r/285084 (https://phabricator.wikimedia.org/T105981) (owner: 10Alex Monk) [02:11:01] mutante: ^ [02:11:13] They are only a dead-end because you made them so a while back when removing non-SSL domains.. [02:11:21] these aren't non-canonical project redirects though [02:11:37] perhaps in scope for Let's Encrypt to restore? [02:11:46] even without HTTPS seems fair to restore imho [02:11:51] at leasts the "canonical" one [02:12:40] org/net/com for wikimediastories T82390 [02:12:55] (03CR) 10Krinkle: "(T82390)" [puppet] - 10https://gerrit.wikimedia.org/r/285084 (https://phabricator.wikimedia.org/T105981) (owner: 10Alex Monk) [02:13:40] did someone pick that work up? [02:14:45] (03PS1) 10Nuria: Blacklisting ImageMetrics schemas [puppet] - 10https://gerrit.wikimedia.org/r/343809 (https://phabricator.wikimedia.org/T141407) [02:18:51] (03CR) 10Krinkle: Blacklisting ImageMetrics schemas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/343809 (https://phabricator.wikimedia.org/T141407) (owner: 10Nuria) [02:21:51] (03PS2) 10Nuria: Blacklisting ImageMetrics schemas [puppet] - 10https://gerrit.wikimedia.org/r/343809 (https://phabricator.wikimedia.org/T141407) [02:24:13] RECOVERY - puppet last run on dbmonitor1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:32:11] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.16) (duration: 13m 22s) [02:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:34] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Mar 21 02:37:33 UTC 2017 (duration 5m 22s) [02:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:03] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [02:44:53] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.411 second response time [02:52:03] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:54:53] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.663 second response time [03:10:29] (03CR) 10Krinkle: [C: 031] Blacklisting ImageMetrics schemas [puppet] - 10https://gerrit.wikimedia.org/r/343809 (https://phabricator.wikimedia.org/T141407) (owner: 10Nuria) [03:51:54] (03PS7) 10Andrew Bogott: nfs-exportd: Refresh service if script or .yaml changes. [puppet] - 10https://gerrit.wikimedia.org/r/343459 [03:57:04] (03CR) 10Andrew Bogott: [C: 032] Nova/wikistatus: Stop dumb page creations/deletions of noisy projects [puppet] - 10https://gerrit.wikimedia.org/r/343798 (owner: 10Chad) [03:57:09] (03PS3) 10Andrew Bogott: Nova/wikistatus: Stop dumb page creations/deletions of noisy projects [puppet] - 10https://gerrit.wikimedia.org/r/343798 (owner: 10Chad) [04:20:14] (03CR) 10Andrew Bogott: "Note that this probably won't take effect until I explicitly restart a bunch of nova-services, which is a little bit messy. You might hav" [puppet] - 10https://gerrit.wikimedia.org/r/343798 (owner: 10Chad) [04:24:13] (03PS4) 10Andrew Bogott: Bootstrapvz: Simplify and update [puppet] - 10https://gerrit.wikimedia.org/r/343208 [04:24:15] (03PS5) 10Andrew Bogott: Keystonehooks: Exclude 'novaobserver' user from posix user group. [puppet] - 10https://gerrit.wikimedia.org/r/343074 (https://phabricator.wikimedia.org/T158650) [04:24:17] (03PS2) 10Andrew Bogott: Designate: Don't use keystone to resolve project id [puppet] - 10https://gerrit.wikimedia.org/r/343356 (https://phabricator.wikimedia.org/T158650) [04:30:59] (03PS1) 10Krinkle: errorpages: Restyle 404.php to be like other error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) [04:32:46] (03PS2) 10Krinkle: errorpages: Restyle 404.php to be like other error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) [04:35:18] (03PS3) 10Krinkle: errorpages: Restyle 404.php to be like other error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) [04:58:06] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3116897 (10aaron) Something like that seems fine, though at first glance it doesn't seem to leve... [05:03:39] (03CR) 10MZMcBride: "This change is fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [05:06:09] (03CR) 10VolkerE: [C: 04-1] errorpages: Restyle 404.php to be like other error pages (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [05:07:43] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:11:30] (03CR) 10Krinkle: errorpages: Restyle 404.php to be like other error pages (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [05:19:51] (03PS1) 10Krinkle: errorpages: Sync with changes to puppet:///varnish/errorpages.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343821 [05:20:07] (03CR) 10Krinkle: "Per https://github.com/wikimedia/puppet/commits/production/modules/varnish/files/errorpage.html" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343821 (owner: 10Krinkle) [05:20:29] (03PS4) 10Krinkle: errorpages: Restyle 404.php to be like other error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) [05:21:40] (03PS5) 10Krinkle: errorpages: Restyle 404.php to be like other error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) [05:24:54] (03CR) 10Gergő Tisza: Blacklisting ImageMetrics schemas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/343809 (https://phabricator.wikimedia.org/T141407) (owner: 10Nuria) [05:35:53] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [05:53:14] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:55:06] !log joal@tin Started deploy [analytics/refinery@c3a9139]: (no justification provided) [05:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:45] !log joal@tin Finished deploy [analytics/refinery@c3a9139]: (no justification provided) (duration: 06m 39s) [06:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:53] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [06:39:53] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [06:41:55] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343822 [06:41:59] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343822 [06:45:12] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343822 (owner: 10Marostegui) [06:48:04] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343822 (owner: 10Marostegui) [06:48:13] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343822 (owner: 10Marostegui) [06:49:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:49:27] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2051 - T160415 - T73563 (duration: 01m 07s) [06:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:33] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [06:49:33] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [06:52:53] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:53:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:58:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:01:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:07:13] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:09:38] (03PS1) 10Marostegui: db-codfw.php: Depool db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343823 (https://phabricator.wikimedia.org/T160415) [07:15:28] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343823 (https://phabricator.wikimedia.org/T160415) (owner: 10Marostegui) [07:16:49] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343823 (https://phabricator.wikimedia.org/T160415) (owner: 10Marostegui) [07:17:40] (03CR) 10jenkins-bot: db-codfw.php: Depool db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343823 (https://phabricator.wikimedia.org/T160415) (owner: 10Marostegui) [07:18:02] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2044 - T160415 - T73563 (duration: 00m 41s) [07:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:08] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [07:18:08] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [07:18:09] !log Deploy schema change on db2044 and labsdb1009 (s4) - https://phabricator.wikimedia.org/T160415 - https://phabricator.wikimedia.org/T73563 [07:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:15] !log Run pt-table-checksum on s6 (jawiki) - T160509 [07:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:22] T160509: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509 [07:25:17] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343824 (https://phabricator.wikimedia.org/T137191) [07:25:53] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:29:53] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [07:30:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [07:32:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343824 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [07:33:36] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343824 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [07:33:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343824 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [07:35:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1092 - T137191 (duration: 00m 42s) [07:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:10] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [07:35:13] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [07:36:20] !log Stop mysql db1092 for maintenance - T137191 [07:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:54] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:46:20] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3117015 (10Marostegui) I see the server is still down, @Papaul did the technician finally show up yesterday? [07:50:07] !log banning elastic2020 from cluster to investigate T149006 [07:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:13] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [07:52:25] (03CR) 10Muehlenhoff: [C: 031] "Looks good; in production netmon1001 is our only remaining precise system. I'm not 100% sure if any of the remaining precise instances in " [puppet] - 10https://gerrit.wikimedia.org/r/342008 (https://phabricator.wikimedia.org/T157131) (owner: 10Muehlenhoff) [07:53:53] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:54:08] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3117017 (10Joe) >>! In T156924#3116897, @aaron wrote: > Something like that seems fine, though a... [07:56:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:57:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:59:13] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:00:43] (03PS4) 10Gilles: Enable memcache-based Thumbor broken thumbnail throttling [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) [08:01:01] 06Operations, 10Graphite, 06Performance-Team: Increase Grafana user rights for Performance team members - https://phabricator.wikimedia.org/T160738#3117018 (10Peter) thank you @Krinkle I'll cleanup all alerts now. [08:01:44] (03CR) 10Giuseppe Lavagetto: Add a stage to switch datacenter (034 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/343537 (owner: 10Giuseppe Lavagetto) [08:05:42] 06Operations, 07Puppet, 13Patch-For-Review: Update puppet-lint to 2.* - https://phabricator.wikimedia.org/T144667#3117019 (10hashar) Got merged as well: https://gerrit.wikimedia.org/r/#/c/342637/ - bump version Welcome to puppet-lint 2.x [08:05:51] (03PS1) 10Ema: tlsproxy: enable Lua support by default [puppet] - 10https://gerrit.wikimedia.org/r/343827 [08:06:53] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [08:07:53] (03PS6) 10Gehel: postgresql - require postgresql / postgis packages for spatialdb [puppet] - 10https://gerrit.wikimedia.org/r/343088 [08:09:19] (03CR) 10Gehel: [C: 032] postgresql - require postgresql / postgis packages for spatialdb [puppet] - 10https://gerrit.wikimedia.org/r/343088 (owner: 10Gehel) [08:09:22] 06Operations, 10Graphite, 06Performance-Team: Increase Grafana user rights for Performance team members - https://phabricator.wikimedia.org/T160738#3117021 (10Peter) I tested now, and I couldn't remove the history but in admin I permissions of a Grafana user. But the role was editor, so I changed to admin... [08:09:53] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [08:10:23] (03PS4) 10DCausse: [es5 upgrade] step 3: depool eqiad for writes (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343665 (https://phabricator.wikimedia.org/T157479) [08:10:39] (03CR) 10Ema: "Alternatively, we could remove the cache::lua_support hiera attribute altogether as I don't see any particular reason to disable Lua suppo" [puppet] - 10https://gerrit.wikimedia.org/r/343827 (owner: 10Ema) [08:11:58] (03PS2) 10Gehel: maps - tuning of postgresql based on experience [puppet] - 10https://gerrit.wikimedia.org/r/343001 (https://phabricator.wikimedia.org/T160556) [08:13:38] (03CR) 10Gehel: [C: 032] maps - tuning of postgresql based on experience [puppet] - 10https://gerrit.wikimedia.org/r/343001 (https://phabricator.wikimedia.org/T160556) (owner: 10Gehel) [08:22:34] (03CR) 10Muehlenhoff: [C: 04-1] Fix some Debian lintian warnnings for the gerrit package (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/343297 (owner: 10Paladox) [08:23:49] (03PS2) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [08:26:47] (03PS1) 10Phuedx: pagePreviews: Increase perf instrumentation sample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343828 [08:27:13] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:29:03] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:43:43] !log shutting down elasticsearch on elastic2020, investigating T149006 [08:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:50] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [08:48:33] PROBLEM - puppet last run on elastic2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:50:40] (03CR) 10Volans: [C: 04-1] "additional comments inline" (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/343537 (owner: 10Giuseppe Lavagetto) [08:53:33] (03CR) 10Gilles: "PCC build passes: https://puppet-compiler.wmflabs.org/5846/" [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [08:56:56] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3117073 (10Gehel) 05Open>03Resolved Running bonnie++ as documented on T153083#2886085 to see if I/O stress as an influence on stability. [08:57:05] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:58:51] (03CR) 10Giuseppe Lavagetto: Add a stage to switch datacenter (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/343537 (owner: 10Giuseppe Lavagetto) [09:01:22] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3117086 (10ArielGlenn) So @chasemp, can we get with Rob and get these boxes ordered? [09:05:34] I am going to potentially break Zuul / CI entirely [09:05:51] I am deploying a change that prioritize operations/puppet.git jenkins job [09:06:33] !log CI deploying config hack "High priority test pipeline" : https://gerrit.wikimedia.org/r/343318 - T160667 [09:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:38] T160667: Create "High Priority" test pipeline - https://phabricator.wikimedia.org/T160667 [09:07:57] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/343803 (https://phabricator.wikimedia.org/T160969) (owner: 10Smalyshev) [09:09:03] !log installing wireshark security updates [09:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:23] RECOVERY - bacula sd process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-sd [09:11:03] PROBLEM - DPKG on pybal-test2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:11:03] PROBLEM - DPKG on pybal-test2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:11:03] PROBLEM - DPKG on tungsten is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:11:13] PROBLEM - DPKG on ununpentium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:11:23] PROBLEM - DPKG on pybal-test2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:11:23] PROBLEM - DPKG on zosma is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:11:24] PROBLEM - DPKG on multatuli is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:11:53] RECOVERY - bacula director process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir [09:12:03] PROBLEM - DPKG on ruthenium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:12:22] having a look [09:13:23] RECOVERY - DPKG on pybal-test2002 is OK: All packages OK [09:13:33] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [09:14:03] RECOVERY - DPKG on pybal-test2003 is OK: All packages OK [09:14:03] RECOVERY - DPKG on pybal-test2001 is OK: All packages OK [09:14:16] (03PS5) 10DCausse: [es5 upgrade] step 3: depool eqiad for writes (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343665 (https://phabricator.wikimedia.org/T157479) [09:14:53] 06Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T160980#3117127 (10GoranSMilovanovic) [09:14:54] !log enable bacula deamons on helium, everything looks ok [09:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:03] RECOVERY - DPKG on tungsten is OK: All packages OK [09:15:24] RECOVERY - DPKG on zosma is OK: All packages OK [09:15:24] RECOVERY - DPKG on multatuli is OK: All packages OK [09:17:03] RECOVERY - DPKG on ruthenium is OK: All packages OK [09:17:13] RECOVERY - DPKG on ununpentium is OK: All packages OK [09:18:23] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [09:25:58] (03PS5) 10Giuseppe Lavagetto: Add a stage to switch datacenter [switchdc] - 10https://gerrit.wikimedia.org/r/343537 [09:26:24] (03CR) 10jerkins-bot: [V: 04-1] Add a stage to switch datacenter [switchdc] - 10https://gerrit.wikimedia.org/r/343537 (owner: 10Giuseppe Lavagetto) [09:27:20] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:34] (03CR) 10Muehlenhoff: "Looks good; in production netmon1001 is our only remaining precise system. I'm not 100% sure if any of the remaining precise instances in " [puppet] - 10https://gerrit.wikimedia.org/r/343309 (https://phabricator.wikimedia.org/T158652) (owner: 10Hashar) [09:28:06] (03PS2) 10Phuedx: pagePreviews: Increase perf instrumentation sample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343828 (https://phabricator.wikimedia.org/T157111) [09:28:53] (03CR) 10Joal: [C: 04-1] "Minor things, and a typo :)" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/343753 (https://phabricator.wikimedia.org/T160083) (owner: 10Ottomata) [09:34:30] (03CR) 10Joal: [C: 04-1] "Another thing: It means we need to deploy refinery on analytics1003 as well!" [puppet] - 10https://gerrit.wikimedia.org/r/343753 (https://phabricator.wikimedia.org/T160083) (owner: 10Ottomata) [09:35:00] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.213 second response time [09:36:40] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [09:40:00] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.409 second response time [09:42:23] !log Stop MySQL db1070 to clone db1092 from it - T137191 [09:42:30] !log installing libevent security updates on remaining hosts in eqiad [09:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:31] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [09:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:40] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:47:24] (03PS6) 10DCausse: [es5 upgrade] step 3: depool eqiad for writes (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343665 (https://phabricator.wikimedia.org/T157479) [09:47:26] (03PS1) 10DCausse: [es5 upgrade] Enable completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343835 [09:51:35] (03PS2) 10Filippo Giunchedi: hieradata: add prometheus200[34] [puppet] - 10https://gerrit.wikimedia.org/r/343693 (https://phabricator.wikimedia.org/T148408) [09:54:49] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add prometheus200[34] [puppet] - 10https://gerrit.wikimedia.org/r/343693 (https://phabricator.wikimedia.org/T148408) (owner: 10Filippo Giunchedi) [09:55:20] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:04:01] (03PS6) 10Volans: Add a stage to switch datacenter [switchdc] - 10https://gerrit.wikimedia.org/r/343537 (owner: 10Giuseppe Lavagetto) [10:05:59] (03PS1) 10Filippo Giunchedi: hieradata: use ms-fe100[5-8] as swift memcache [puppet] - 10https://gerrit.wikimedia.org/r/343837 (https://phabricator.wikimedia.org/T155095) [10:07:12] 06Operations, 15User-fgiunchedi: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#3117300 (10fgiunchedi) a:05akosiaris>03fgiunchedi [10:10:07] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: use ms-fe100[5-8] as swift memcache [puppet] - 10https://gerrit.wikimedia.org/r/343837 (https://phabricator.wikimedia.org/T155095) (owner: 10Filippo Giunchedi) [10:16:41] (03CR) 10Volans: [C: 032] Add a stage to switch datacenter [switchdc] - 10https://gerrit.wikimedia.org/r/343537 (owner: 10Giuseppe Lavagetto) [10:23:10] (03PS1) 10Filippo Giunchedi: swift: decom ms-fe100[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/343840 (https://phabricator.wikimedia.org/T155095) [10:23:29] 06Operations, 10Wikidata, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Kabiye - https://phabricator.wikimedia.org/T160868#3117377 (10Dereckson) [10:24:37] 06Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T160980#3117379 (10GoranSMilovanovic) [10:24:45] (03PS5) 10Addshore: Use wmgUseInterwikiSorting for labs from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341036 [10:25:01] 06Operations, 10Wikidata, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Kabiye - https://phabricator.wikimedia.org/T160868#3113180 (10Dereckson) 05Open>03stalled @MF-Warburg Wikimedia language engineering isn't currently happy with translation progress: they would like 100 more messag... [10:26:24] (03CR) 10Filippo Giunchedi: [C: 032] swift: decom ms-fe100[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/343840 (https://phabricator.wikimedia.org/T155095) (owner: 10Filippo Giunchedi) [10:28:27] 06Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T160980#3117127 (10Dereckson) @GoranSMilovanovic RESOURCE and USER[S] are placeholders to be replaced by real data [10:29:48] 06Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3117401 (10GoranSMilovanovic) [10:30:00] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:30:18] (03PS1) 10Filippo Giunchedi: fix spare::system [puppet] - 10https://gerrit.wikimedia.org/r/343843 [10:31:27] 06Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3117127 (10GoranSMilovanovic) @Dereckson Sorry, still working on it... figuring out with @Addshore what RESOURCE[S} will I need... [10:33:07] !log Run pt-table-checksum on s6 (ruwiki) - https://phabricator.wikimedia.org/T160509 [10:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:41] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] fix spare::system [puppet] - 10https://gerrit.wikimedia.org/r/343843 (owner: 10Filippo Giunchedi) [10:35:00] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:37:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [10:37:10] (03CR) 10Addshore: "@Muehlenhoff did you come to any decision over this while you were generally looking at ldap groups?" [puppet] - 10https://gerrit.wikimedia.org/r/333024 (owner: 10Addshore) [10:38:38] 06Operations, 10hardware-requests: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3117417 (10fgiunchedi) [10:38:45] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and set up ms-fe100[5-8] - https://phabricator.wikimedia.org/T155095#3117428 (10fgiunchedi) 05Open>03Resolved Done [10:39:46] (03PS1) 10Ema: varnish: swap around backend ttl cap and keep values [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/343844 [10:39:48] (03PS1) 10Ema: varnish: swap around backend ttl cap and keep values [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/343845 (https://phabricator.wikimedia.org/T124954) [10:40:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [10:40:13] 06Operations, 10hardware-requests, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3117417 (10fgiunchedi) [10:40:33] (03CR) 10Muehlenhoff: [C: 031] "This looks fine to me, but it is somewhat comparable to an access request, so let me raise this in the next Ops meeting (next Monday) and " [puppet] - 10https://gerrit.wikimedia.org/r/333024 (owner: 10Addshore) [10:41:04] (03PS2) 10Ema: varnish: swap around backend ttl cap and keep values [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/343844 (https://phabricator.wikimedia.org/T124954) [10:41:59] (03PS2) 10Ema: varnish: swap around backend ttl cap and keep values [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/343845 (https://phabricator.wikimedia.org/T124954) [10:57:58] 06Operations, 10Pybal, 10Traffic: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433#3117465 (10ema) I've been testing twisted 16.2.0 on pybal-test2001 for a while and the various monitoring protocols look good. I'm going to upgrade twisted on lvs1007-12 as a next step. [11:00:18] !log upgrading twisted to 16.2.0 on lvs1007-12 T160433 [11:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:24] T160433: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433 [11:13:53] 06Operations, 10Wikidata, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Kabiye - https://phabricator.wikimedia.org/T160868#3117486 (10Dereckson) [11:34:35] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3117511 (10GoranSMilovanovic) [11:35:00] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3117127 (10GoranSMilovanovic) @Dereckson Done. Let me know if the ticket is Ok now, please. Thanks. [11:38:58] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3117516 (10Dereckson) @GoranSMilovanovic You need to follow https://wikitech.wikimedia.org/wiki/Production_shell_access#Requesting_access, so a... [11:40:56] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2044" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343850 [11:40:59] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3117517 (10GoranSMilovanovic) @Dereckson Yes I have signed L3 from my Wikitech account. [11:41:00] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2044" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343850 [11:49:14] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:50:46] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtest - https://phabricator.wikimedia.org/T154706#3117556 (10chasemp) >>! In T154706#3116218, @RobH wrote: > @chasemp: > > Is there a specific existing server that meets this requirement to base a new spec off of? > There... [11:50:55] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtest - https://phabricator.wikimedia.org/T154706#3117562 (10chasemp) [11:54:27] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3117127 (10MoritzMuehlenhoff) Hi Goran, your access needs an NDA. What's your @wikimedia.de email address (I couldn't find you on https://www.w... [11:54:29] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3117577 (10chasemp) >>! In T158207#3116228, @RobH wrote: > Is there a specific cpu seed we have to stick to? 24 cores without HT is dual 12 core CPUs. Anything b... [11:57:41] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3117579 (10chasemp) >>! In T158204#3116230, @RobH wrote: > Is there a specific cpu seed we have to stick to? 24 cores without HT is dual 12 core CPUs. Anything between... [11:59:38] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3117580 (10GoranSMilovanovic) Hi Moritz, I understand that the request needs an NDA, but I still do not have a @wikimedia.de email address; co... [12:00:40] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3117581 (10chasemp) >>! In T118154#3117086, @ArielGlenn wrote: > So @chasemp, can we get with Rob and get these boxes ordered? +1 thank you [12:18:14] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:27:43] !log Create account Superzerocool on projectcomwiki (bureaucrat, T143138) [12:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:50] T143138: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138 [12:28:58] heading out to grab a snack, will be back for swat [12:29:04] jouncebot: next [12:29:04] In 0 hour(s) and 30 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170321T1300) [12:34:14] !log Created OATHAuth tables on projectcomwiki (T143138) [12:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:21] T143138: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138 [12:40:50] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3117643 (10chasemp) 05Open>03stalled Let's hold on this one out of the pending 3 for last, I want to do some more review on CPU specs since the existing is such a... [12:42:32] (03PS1) 10Volans: Add MediaWiki config tasks for ro/rw mode [switchdc] - 10https://gerrit.wikimedia.org/r/343858 (https://phabricator.wikimedia.org/T160178) [12:43:56] (03PS1) 10Volans: Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) [12:46:00] (03PS2) 10Volans: Add varnish text caches task [switchdc] - 10https://gerrit.wikimedia.org/r/343629 (https://phabricator.wikimedia.org/T160178) [12:46:24] PROBLEM - DPKG on auth2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:46:24] PROBLEM - puppet last run on auth1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [12:47:04] PROBLEM - DPKG on auth1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:47:36] !log running stress and bonnie on elastic2020 - T149006 [12:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:43] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [12:49:04] RECOVERY - DPKG on auth1001 is OK: All packages OK [12:49:24] RECOVERY - DPKG on auth2001 is OK: All packages OK [12:49:26] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3117661 (10Gehel) stress is launched with `stress --cpu 28 --vm 4` [12:50:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I would uniform on 3 minutes for the message; otherwise it LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:51:27] (03PS3) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [12:51:44] (03CR) 10Jcrespo: "Yes, it was only set to 15 minutes for the latest datacenter switch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:53:45] (03CR) 10Volans: "Ok, although I needed to know the actual value for when we will make the DC switchover. I'll change it back to 3 minutes for now anyway." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:54:43] (03PS2) 10Volans: Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) [12:54:45] !log installing r-base security updates [12:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Add a batch size to the puppet run. Or verify if can handle more than 100 concurrent puppet runs." (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/343629 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:59:13] (03PS3) 10Volans: Add varnish text caches task [switchdc] - 10https://gerrit.wikimedia.org/r/343629 (https://phabricator.wikimedia.org/T160178) [12:59:54] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.473 second response time [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170321T1300). [13:00:04] dcausse, addshore, and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:16] I can swat [13:00:54] checked the patches and I think I can do it if there are no objections [13:01:06] o/ [13:01:09] dcausse: epic! [13:01:12] :) [13:01:22] o/ [13:01:26] o/ [13:01:28] dcausse: godspeed [13:02:03] hashar: no objections? I'll send addshore and phuedx's patches first then mines [13:02:11] all good to me :} [13:02:36] you guys know what you are doing more than I . I will idle around to provide support for the scap / git part [13:03:39] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341036 (owner: 10Addshore) [13:04:48] (03Merged) 10jenkins-bot: Use wmgUseInterwikiSorting for labs from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341036 (owner: 10Addshore) [13:04:54] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.110 second response time [13:05:13] (03CR) 10jenkins-bot: Use wmgUseInterwikiSorting for labs from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341036 (owner: 10Addshore) [13:05:45] addshore: I think you're all set [13:06:44] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "One comment, a couple of minor observations." (033 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/343858 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:06:57] dcausse: looks all good on beta [13:07:14] phuedx: you're next [13:07:22] (03CR) 10Ottomata: Load wiki project namespace map into HDFS weekly, sqoop mediawiki monthly (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/343753 (https://phabricator.wikimedia.org/T160083) (owner: 10Ottomata) [13:07:23] * phuedx runs for the hills [13:07:30] (03CR) 10Giuseppe Lavagetto: [C: 032] Add varnish text caches task [switchdc] - 10https://gerrit.wikimedia.org/r/343629 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:07:30] dcausse: sure [13:07:43] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343828 (https://phabricator.wikimedia.org/T157111) (owner: 10Phuedx) [13:09:17] (03Merged) 10jenkins-bot: pagePreviews: Increase perf instrumentation sample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343828 (https://phabricator.wikimedia.org/T157111) (owner: 10Phuedx) [13:09:26] (03CR) 10jenkins-bot: pagePreviews: Increase perf instrumentation sample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343828 (https://phabricator.wikimedia.org/T157111) (owner: 10Phuedx) [13:09:52] (03PS3) 10Gehel: Add Blazegraph options for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/343803 (https://phabricator.wikimedia.org/T160969) (owner: 10Smalyshev) [13:10:54] (03PS2) 10Ottomata: Load wiki project namespace map into HDFS weekly, sqoop mediawiki monthly [puppet] - 10https://gerrit.wikimedia.org/r/343753 (https://phabricator.wikimedia.org/T160083) [13:11:18] (03CR) 10Gehel: [C: 032] Add Blazegraph options for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/343803 (https://phabricator.wikimedia.org/T160969) (owner: 10Smalyshev) [13:11:29] (03PS2) 10Volans: Add MediaWiki config tasks for ro/rw mode [switchdc] - 10https://gerrit.wikimedia.org/r/343858 (https://phabricator.wikimedia.org/T160178) [13:11:39] phuedx: live on mwdebug1002, please test if possible [13:11:45] dcausse: ta [13:12:35] (03CR) 10jerkins-bot: [V: 04-1] Load wiki project namespace map into HDFS weekly, sqoop mediawiki monthly [puppet] - 10https://gerrit.wikimedia.org/r/343753 (https://phabricator.wikimedia.org/T160083) (owner: 10Ottomata) [13:12:45] !log Clear centralauth for RickinBaltimore per T160671 [13:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:51] T160671: Need help with 2FA - https://phabricator.wikimedia.org/T160671 [13:13:11] (03CR) 10Volans: "replies inline" (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/343858 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:13:14] (03CR) 10Giuseppe Lavagetto: [C: 031] "I would be more confident in the code if we had unit tests for this part :)" [switchdc] - 10https://gerrit.wikimedia.org/r/343633 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:13:41] dcausse: 👍 value's up to date [13:13:46] (03PS4) 10Volans: Add varnish text caches task [switchdc] - 10https://gerrit.wikimedia.org/r/343629 (https://phabricator.wikimedia.org/T160178) [13:13:55] phuedx: thanks, will sync [13:14:07] there's not much to test, i just need to keep an eye on https://grafana.wikimedia.org/dashboard/db/graphite-eqiad for the next couple of hours [13:14:15] !log Make that clear 2FA for RickinBaltimore per T160671 [13:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:24] RECOVERY - puppet last run on auth1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:15:15] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: T157111 pagePreviews: Increase perf instrumentation sample (duration: 00m 58s) [13:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:21] T157111: [8 hours] Create a Page Previews performance dashboard - https://phabricator.wikimedia.org/T157111 [13:15:29] thanks dcausse [13:15:34] phuedx: yw! [13:16:37] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343665 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [13:20:47] (03PS7) 10DCausse: [es5 upgrade] step 3: depool eqiad for writes (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343665 (https://phabricator.wikimedia.org/T157479) [13:22:32] (03CR) 10DCausse: [C: 032] "SWAT take 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343665 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [13:23:41] (03PS1) 10Gehel: wdqs - correct quotes in default configuration [puppet] - 10https://gerrit.wikimedia.org/r/343863 [13:23:58] (03Merged) 10jenkins-bot: [es5 upgrade] step 3: depool eqiad for writes (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343665 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [13:24:15] (03CR) 10jenkins-bot: [es5 upgrade] step 3: depool eqiad for writes (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343665 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [13:25:52] (03CR) 10Gehel: [C: 032] wdqs - correct quotes in default configuration [puppet] - 10https://gerrit.wikimedia.org/r/343863 (owner: 10Gehel) [13:27:37] !log dcausse@tin Synchronized wmf-config/CommonSettings.php: [es5 upgrade] step 3: depool eqiad for writes (take 2) (1/3) (duration: 00m 42s) [13:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:03] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3117737 (10Beetlebeard) >>! In T158638#3110288, @Dzahn wrote: > @Kaarel_Vaidla @Beetlebeard I can make the change but i wanted to check with... [13:29:12] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: [es5 upgrade] step 3: depool eqiad for writes (take 2) (2/3) (duration: 00m 43s) [13:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:35] !log rolling restart of wdqs to load new configuration options [13:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:54] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.407 second response time [13:30:41] !log dcausse@tin Synchronized wmf-config/CirrusSearch-common.php: [es5 upgrade] step 3: depool eqiad for writes (take 2) (3/3) (duration: 00m 41s) [13:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:31] gehel: fyi elastic@eqiad should be fully depooled (mw wise), I'll try to reenable the completion suggester now [13:31:52] dcausse: I'll wait a bit before upgrading, just in case :) [13:32:00] (03PS2) 10DCausse: [es5 upgrade] Enable completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343835 [13:32:01] yes :) [13:34:22] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343835 (owner: 10DCausse) [13:34:54] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.586 second response time [13:35:34] PROBLEM - puppet last run on db1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:46] (03Merged) 10jenkins-bot: [es5 upgrade] Enable completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343835 (owner: 10DCausse) [13:37:24] (03CR) 10jenkins-bot: [es5 upgrade] Enable completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343835 (owner: 10DCausse) [13:39:39] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: [es5 upgrade] Enable completion suggester (duration: 00m 42s) [13:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:59] looks so far [13:40:01] good [13:42:11] waiting 2 more minutes before calling this done [13:44:13] !log eu SWAT done [13:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:03] 06Operations, 10media-storage: File not found after reupload - https://phabricator.wikimedia.org/T125140#3117752 (10Aklapper) 05Open>03declined Declining as per last comment. :( [13:49:12] (03PS3) 10Muehlenhoff: Harmomise group type for LDAP admin access [puppet] - 10https://gerrit.wikimedia.org/r/342008 (https://phabricator.wikimedia.org/T157131) [13:50:45] (03CR) 10Muehlenhoff: [C: 032] Harmomise group type for LDAP admin access [puppet] - 10https://gerrit.wikimedia.org/r/342008 (https://phabricator.wikimedia.org/T157131) (owner: 10Muehlenhoff) [13:51:56] moritzm: too late, but s/harmomise/harmonize/ :) [13:52:46] 06Operations, 07discovery-system: Create the nulloid service as fallback for the DNS discovery - https://phabricator.wikimedia.org/T160994#3117766 (10Volans) [13:53:51] en-US vs en-UK :D [13:57:11] (03PS1) 10Marostegui: db-codfw.php: Repool db2044, depool db2037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343867 (https://phabricator.wikimedia.org/T73563) [13:57:39] (03Abandoned) 10Marostegui: Revert "db-codfw.php: Depool db2044" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343850 (owner: 10Marostegui) [13:59:01] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2044, depool db2037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343867 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [14:00:20] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2044, depool db2037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343867 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [14:00:28] (03CR) 10jenkins-bot: db-codfw.php: Repool db2044, depool db2037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343867 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [14:01:55] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2044, depool db2037 T160415 - T73563 (duration: 00m 42s) [14:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:02] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [14:02:02] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [14:03:34] RECOVERY - puppet last run on db1038 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [14:03:39] (03PS1) 10Muehlenhoff: Revert "Harmomise group type for LDAP admin access" [puppet] - 10https://gerrit.wikimedia.org/r/343868 [14:04:24] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:04:38] (03PS1) 10Gehel: elasticsearch - upgrade eqiad to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/343869 (https://phabricator.wikimedia.org/T157479) [14:04:54] volans: I was pointing out the m/n, not the s/z :) [14:05:16] (03CR) 10Muehlenhoff: [C: 032] Revert "Harmomise group type for LDAP admin access" [puppet] - 10https://gerrit.wikimedia.org/r/343868 (owner: 10Muehlenhoff) [14:06:10] (03PS2) 10Gehel: elasticsearch - upgrade eqiad to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/343869 (https://phabricator.wikimedia.org/T157479) [14:06:32] (03CR) 10DCausse: [C: 031] elasticsearch - upgrade eqiad to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/343869 (https://phabricator.wikimedia.org/T157479) (owner: 10Gehel) [14:07:30] lol paravoid :) [14:07:34] !log upgrading elasticsearch eqiad to v5.x - T157479 [14:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:41] T157479: Put together a production migration plan for ES 2 -> ES 5 - https://phabricator.wikimedia.org/T157479 [14:11:44] hopefully, I've remembered to downtime all elasticsearch related inciga checks... [14:12:05] (03CR) 10Gehel: [C: 032] elasticsearch - upgrade eqiad to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/343869 (https://phabricator.wikimedia.org/T157479) (owner: 10Gehel) [14:13:24] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1036.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1051.eqiad.wmnet because of too many down! [14:13:29] (03PS20) 10BBlack: [POC] DNS zones to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/342887 [14:13:34] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1028.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1051.eqiad.wmnet because of too many down! [14:13:46] pybal is me... [14:15:44] PROBLEM - HP RAID on db2037 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [14:16:14] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [14:18:24] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:19:34] (03PS3) 10Ema: varnish: swap around backend ttl cap and keep values [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/343844 [14:19:36] (03PS3) 10Ema: varnish: swap around backend ttl cap and keep values [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/343845 (https://phabricator.wikimedia.org/T124954) [14:20:42] db2037? is it load? [14:21:27] marostegui, probably you are doing a schema change there? [14:21:41] Yep, I silenced it I believe [14:22:00] Ah, only the lag check [14:22:01] I see, that RAID check has a small timeout [14:22:18] compared to how much time it can take to execute under load [14:22:21] yeah, downtiming it now [14:22:32] ok with that [14:22:41] but maybe the check could change [14:22:46] let's hope the alter doesn't break a disk :) [14:23:14] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [14:23:46] I would need to ask some people about hardware and monitoring, and see how we could agree to do that [14:24:00] I think godog suffered from similar issues on swift machines [14:24:45] or maybe alter the check avoiding too much contention [14:25:04] RECOVERY - HP RAID on db2037 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor [14:25:13] jynus: it has already 50s of timeout [14:25:25] volans, I know [14:25:32] but apparently it is not enough [14:25:45] but the raid is not really bad [14:26:00] I understand the timeout is already very high [14:26:15] on the other side, RAID issues normally do not require prompt response [14:26:27] some maybe a different call can be done [14:26:37] like checking one disk at a time or something [14:27:24] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:27:41] or executing the check on a cron, icinga only reading the results [14:27:57] so the timeout is actually lower, even if the check takes minutes [14:28:20] nrpe has 2 timeouts, the one on the single check and one in the nrpe.cfg config in each host [14:28:32] the latter is set to 60s as of now and applies to all checks [14:29:15] I do not thing making the timeout larger is the way [14:31:36] bd808: can I get a hand with a semi-urgent wikitech patch? Looks like I missed SWAT for this morning. (and also I can never remember how to do the backport bits) [14:31:38] https://gerrit.wikimedia.org/r/#/c/343873 [14:31:43] another way could be with passive checks... but meh [14:32:21] volans, there is always a way- the only problems is which tradeoff to take [14:32:24] :-) [14:32:24] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [14:33:22] 06Operations, 07discovery-system: Create the failoid service as fallback for the DNS discovery - https://phabricator.wikimedia.org/T160994#3117918 (10Volans) [14:34:12] !log deleting old v2 indices from elastic1030: azbwiki_general_first, vewikimedia_content_1415331110, vewikimedia_general_1415331150 - T157479 [14:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:18] T157479: Put together a production migration plan for ES 2 -> ES 5 - https://phabricator.wikimedia.org/T157479 [14:36:32] (03PS1) 10Volans: Add entries for ganeti instances for failoid [dns] - 10https://gerrit.wikimedia.org/r/343877 (https://phabricator.wikimedia.org/T160994) [14:39:25] !log deleting old v2 indices from each elasticsearch server - T157479 [14:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:30] T157479: Put together a production migration plan for ES 2 -> ES 5 - https://phabricator.wikimedia.org/T157479 [14:44:03] !log elasticsearch eqiad, full cluster restart after cleanup of known old indices - T157479 [14:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:25] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [14:45:28] (03PS1) 10Marostegui: labsdb-replica.my.cnf: Set slave_type_conversions [puppet] - 10https://gerrit.wikimedia.org/r/343879 (https://phabricator.wikimedia.org/T73563) [14:45:34] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [14:47:24] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [14:47:42] andrewbogott: o/ still need help on that patch? [14:47:53] bd808: yes please [14:48:43] (03CR) 10Marostegui: "This looks good: https://puppet-compiler.wmflabs.org/5850/" [puppet] - 10https://gerrit.wikimedia.org/r/343879 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [14:49:08] jouncebot: now [14:49:08] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [14:49:52] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3117957 (10akosiaris) >>! In T159615#3116226, @mobrovac wrote: > How about replicating the precaching redis instance across DCs? Wou... [14:50:03] !log installing gnutls security updates on trusty (jessie already fixed) [14:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:13] andrewbogott: +2 to master. Looks like the whole world is still on 1.29.0-wmf.16 so we will only need one backport [14:50:37] the trick for that is just to use the "cherry pick" button in gerrit [14:51:12] andrewbogott: here's the cherry pick -- https://gerrit.wikimedia.org/r/#/c/343880/ [14:51:37] you make it sound so easy :) [14:51:50] I have practiced :) [14:52:15] so we just +2 that cherry pick and then everything merges on its own? [14:52:22] (and then I do a scap-pull on silver) [14:53:03] the +2 will bump the submodule. then we need to pull it down on tin [14:53:22] once it is on tin then I'll sync it to silver [14:54:14] there is a huge refresher page on all this at https://wikitech.wikimedia.org/wiki/How_to_deploy_code [14:54:48] (03CR) 10Marostegui: [C: 032] labsdb-replica.my.cnf: Set slave_type_conversions [puppet] - 10https://gerrit.wikimedia.org/r/343879 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [14:55:10] * bd808 waits patiently for gate-and-submit [14:55:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] "mostly LGTM, inline comment" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/343877 (https://phabricator.wikimedia.org/T160994) (owner: 10Volans) [14:56:24] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:57:45] (03PS2) 10Volans: Add entries for ganeti instances for failoid [dns] - 10https://gerrit.wikimedia.org/r/343877 (https://phabricator.wikimedia.org/T160994) [14:57:53] (03CR) 10Volans: "done" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/343877 (https://phabricator.wikimedia.org/T160994) (owner: 10Volans) [14:58:56] !log elasticsearch upgrade on eqiad is completed - T157479 [14:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:04] T157479: Put together a production migration plan for ES 2 -> ES 5 - https://phabricator.wikimedia.org/T157479 [14:59:04] (03CR) 10Alexandros Kosiaris: [C: 031] Add entries for ganeti instances for failoid [dns] - 10https://gerrit.wikimedia.org/r/343877 (https://phabricator.wikimedia.org/T160994) (owner: 10Volans) [15:00:14] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [15:00:35] (03PS1) 10Giuseppe Lavagetto: confd: fix check command for template errors [puppet] - 10https://gerrit.wikimedia.org/r/343882 [15:00:55] <_joe_> can someone look at mw fatals? [15:00:59] !log bd808@tin Synchronized php-1.29.0-wmf.16/extensions/OpenStackManager/special/SpecialNovaInstance.php: SpecialNovaInstance: Remove some totally useless domain code. (T160995) (duration: 00m 43s) [15:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:06] T160995: Wikitech 'Requested domain is invalid' - https://phabricator.wikimedia.org/T160995 [15:01:16] andrewbogott: ^ should be live [15:01:28] <_joe_> bd808: we're having mediawiki fatals alerting right now [15:01:47] bd808: yep, fixed! thank you [15:01:57] Pretty sure I didn't break MW in the process... [15:02:22] _joe_: going to check logstash now. fatalmonitor script doesn't scary [15:03:09] _joe_: looking I see a bunch of errors related to cirrus [15:03:27] <_joe_> dcausse: that might be related to what you are doing? [15:03:44] there were some big lag warning spikes [15:03:53] yes related to the previous swat, not sure why it starts to complain just now :/ [15:04:12] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/5851/einsteinium.wikimedia.org/ says it's ok, merging" [puppet] - 10https://gerrit.wikimedia.org/r/343619 (owner: 10Alexandros Kosiaris) [15:04:17] (03PS3) 10Alexandros Kosiaris: netops::check: Add a parents attribute [puppet] - 10https://gerrit.wikimedia.org/r/343619 [15:04:42] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] netops::check: Add a parents attribute [puppet] - 10https://gerrit.wikimedia.org/r/343619 (owner: 10Alexandros Kosiaris) [15:04:51] (03PS2) 10Giuseppe Lavagetto: confd: fix check command for template errors [puppet] - 10https://gerrit.wikimedia.org/r/343882 [15:06:33] (03CR) 10Giuseppe Lavagetto: [C: 032] confd: fix check command for template errors [puppet] - 10https://gerrit.wikimedia.org/r/343882 (owner: 10Giuseppe Lavagetto) [15:08:11] _joe_: All I really see is 3 short'ish spikes of replag errors [15:08:22] <_joe_> bd808: yeah I agree [15:08:35] ty bd808 [15:09:03] andrewbogott: yw. you have to work a lot harder to merge my puppet patches :) [15:09:21] hm my problem generates around ~100 errors per minute [15:10:14] PROBLEM - Confd template for /var/lib/gdnsd/discovery-appservers-rw.state on radon is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-appservers-rw.state is broken [15:10:28] (03CR) 10Ottomata: Blacklisting ImageMetrics schemas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/343809 (https://phabricator.wikimedia.org/T141407) (owner: 10Nuria) [15:11:16] <_joe_> ok this is much better :) [15:12:07] (03PS1) 10Muehlenhoff: Harmonise group type for LDAP admin access [puppet] - 10https://gerrit.wikimedia.org/r/343884 (https://phabricator.wikimedia.org/T157131) [15:13:50] (03CR) 10Muehlenhoff: [C: 032] Harmonise group type for LDAP admin access [puppet] - 10https://gerrit.wikimedia.org/r/343884 (https://phabricator.wikimedia.org/T157131) (owner: 10Muehlenhoff) [15:14:38] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 334 MB (3% inode=74%) [15:14:44] dcausse: can I help on those cirrus errors? [15:15:12] gehel: not sure, will investigate I think it's related to new elastic5 code [15:17:33] (03CR) 10Volans: [C: 032] Add entries for ganeti instances for failoid [dns] - 10https://gerrit.wikimedia.org/r/343877 (https://phabricator.wikimedia.org/T160994) (owner: 10Volans) [15:18:39] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3117998 (10Gehel) 05Resolved>03Open I resolved this by mistake, re-opening. [15:19:09] (03CR) 10Chad: Fix some Debian lintian warnnings for the gerrit package (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/343297 (owner: 10Paladox) [15:19:18] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [15:23:02] (03PS2) 10Giuseppe Lavagetto: realm: remove parsoid_site, switch to discovery. [puppet] - 10https://gerrit.wikimedia.org/r/340993 [15:27:02] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3118050 (10Gehel) After ~25' of stress + bonnie elastic2020 crashed again. That seem to indicate a systematic issue. Test can be seen on [[ htt... [15:27:28] (03PS3) 10Nuria: Blacklisting ImageMetrics schemas [puppet] - 10https://gerrit.wikimedia.org/r/343809 (https://phabricator.wikimedia.org/T141407) [15:27:42] !log removed "Directory Managers" group from LDAP (Bug T157131) [15:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:47] T157131: Harmonise "Directory Managers" group - https://phabricator.wikimedia.org/T157131 [15:28:07] (03PS1) 10Subramanya Sastry: Fix nginx config for serving individual visualdiff pngs [puppet] - 10https://gerrit.wikimedia.org/r/343886 [15:28:09] (03PS1) 10Subramanya Sastry: Enable devAPI in Parsoid to enable /_rt/ in testreduce results UI [puppet] - 10https://gerrit.wikimedia.org/r/343887 [15:29:06] (03PS4) 10Ottomata: Blacklisting ImageMetrics schemas [puppet] - 10https://gerrit.wikimedia.org/r/343809 (https://phabricator.wikimedia.org/T141407) (owner: 10Nuria) [15:29:09] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3118056 (10Gehel) elastic2020 is banned from elasticsearch cluster and has a 1 month downtime in icinga. Let's figure out what we can do with i... [15:29:13] (03CR) 10Ottomata: [V: 032 C: 032] Blacklisting ImageMetrics schemas [puppet] - 10https://gerrit.wikimedia.org/r/343809 (https://phabricator.wikimedia.org/T141407) (owner: 10Nuria) [15:29:18] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [15:29:48] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#3118063 (10MoritzMuehlenhoff) [15:29:51] 06Operations, 13Patch-For-Review: Harmonise "Directory Managers" group - https://phabricator.wikimedia.org/T157131#3118060 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff LDAP ACLs have been converted to use "cn=ldap_ops" (which is a standard group) and "cn=Directory Managers" has eventuall... [15:29:58] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:32:39] (03PS1) 10Subramanya Sastry: Migrate away from parsoid-tests.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/343888 [15:34:37] (03CR) 10Chad: "This isn't urgent at all. If it's messy we can do one of three things:" [puppet] - 10https://gerrit.wikimedia.org/r/343798 (owner: 10Chad) [15:36:43] (03PS1) 10Alexandros Kosiaris: sync_icinga_state: stop/start the service [puppet] - 10https://gerrit.wikimedia.org/r/343889 [15:37:30] (03PS1) 10Volans: Add entries for failoid VMs [puppet] - 10https://gerrit.wikimedia.org/r/343890 (https://phabricator.wikimedia.org/T160994) [15:38:11] volans: Do we have to add oid to everything? :P [15:38:57] (03PS1) 10Marostegui: db-eqiad.php: Repool db1092 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343892 (https://phabricator.wikimedia.org/T137191) [15:38:57] Reedy: I was told it's a tradition :) [15:38:58] PROBLEM - Host es2015 is DOWN: PING CRITICAL - Packet loss = 100% [15:39:26] ^ that came back from downtime [15:39:29] I will silence it [15:39:32] papaul: you around? [15:39:36] yeah I guessed [15:40:33] marostegui: yes [15:41:01] papaul: in the end was the mainboard for es2015 replaced? https://phabricator.wikimedia.org/T160242 [15:41:16] marostegui: he is working on it now [15:41:20] ah nice! [15:41:22] :) [15:41:36] papaul: once he is done, feel free to power on the server! [15:41:46] marostegui: will do [15:41:51] papaul: cheers! [15:44:37] 06Operations, 07LDAP: Cross-check disabled accounts from corp LDAP against data.yaml - https://phabricator.wikimedia.org/T161003#3118070 (10MoritzMuehlenhoff) [15:46:59] 06Operations, 06Office-IT, 07LDAP: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#3118086 (10MoritzMuehlenhoff) [15:48:48] (03CR) 10Marostegui: [C: 04-2] "Wait till the lag is gone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343892 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [15:49:33] (03CR) 10Mobrovac: [C: 031] Enable devAPI in Parsoid to enable /_rt/ in testreduce results UI [puppet] - 10https://gerrit.wikimedia.org/r/343887 (owner: 10Subramanya Sastry) [15:50:41] (03PS1) 10Madhuvishy: tools: Update maintain-dbusers to create labsdb accounts for tools users [puppet] - 10https://gerrit.wikimedia.org/r/343894 (https://phabricator.wikimedia.org/T158420) [15:50:52] (03CR) 10Mobrovac: Fix nginx config for serving individual visualdiff pngs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/343886 (owner: 10Subramanya Sastry) [15:52:18] (03CR) 10jerkins-bot: [V: 04-1] tools: Update maintain-dbusers to create labsdb accounts for tools users [puppet] - 10https://gerrit.wikimedia.org/r/343894 (https://phabricator.wikimedia.org/T158420) (owner: 10Madhuvishy) [15:56:40] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic2020.codfw.wmnet [15:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:49] (03PS2) 10Madhuvishy: tools: Update maintain-dbusers to create labsdb accounts for tools users [puppet] - 10https://gerrit.wikimedia.org/r/343894 (https://phabricator.wikimedia.org/T158420) [15:57:54] (03CR) 10Dzahn: "using them is blocked by https://phabricator.wikimedia.org/T133548" [puppet] - 10https://gerrit.wikimedia.org/r/285084 (https://phabricator.wikimedia.org/T105981) (owner: 10Alex Monk) [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170321T1600). [16:00:04] gwicke, eevans, and RainbowSprinkles: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:49] * urandom is available o/ [16:01:06] (03PS2) 10Subramanya Sastry: Fix nginx config for serving individual visualdiff pngs [puppet] - 10https://gerrit.wikimedia.org/r/343886 [16:01:09] (03PS2) 10Subramanya Sastry: Enable devAPI in Parsoid to enable /_rt/ in testreduce results UI [puppet] - 10https://gerrit.wikimedia.org/r/343887 [16:01:11] (03PS2) 10Subramanya Sastry: Migrate away from parsoid-tests.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/343888 [16:01:21] I am here :) [16:04:10] looking [16:04:32] (03PS5) 10Eevans: Enable encrypted client connections in RESTBase production [puppet] - 10https://gerrit.wikimedia.org/r/342903 (https://phabricator.wikimedia.org/T111113) [16:06:18] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [16:06:37] urandom: going with yours now since you just rebased [16:06:44] (03CR) 10Filippo Giunchedi: [C: 032] Enable encrypted client connections in RESTBase production [puppet] - 10https://gerrit.wikimedia.org/r/342903 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [16:06:49] godog: cool; thanks! [16:09:25] RainbowSprinkles: I just noticed https://gerrit.wikimedia.org/r/#/c/342788 needs the parent to be created too, e.g. /srv/deployment/mediawiki or the symlink will fail :( [16:09:30] urandom: np! merged [16:09:50] godog: Ah, lemme amend, good catch [16:10:18] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:11:18] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [16:11:19] !log T111113: Enabling RESTBase client encryption on restbase2001.codfw.wmnet (canary) [16:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:25] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [16:11:36] gwicke: here? I was looking at https://gerrit.wikimedia.org/r/#/c/341833 [16:12:06] (03CR) 10Dzahn: "If you are personally interested in these non-working domains, feel free to have that general discussion about buying domains, certs and b" [puppet] - 10https://gerrit.wikimedia.org/r/285084 (https://phabricator.wikimedia.org/T105981) (owner: 10Alex Monk) [16:13:29] (03CR) 10Paladox: "> @Paladox, could you add me when the 2 upstream things have merged?" [puppet] - 10https://gerrit.wikimedia.org/r/343736 (owner: 10Paladox) [16:14:12] (03CR) 10Paladox: "> We can wait until those land (mostly to make sure it doesn't change" [puppet] - 10https://gerrit.wikimedia.org/r/343736 (owner: 10Paladox) [16:14:42] (03PS3) 10Chad: Scap3: Prep MediaWiki to be available from /srv/deployment [puppet] - 10https://gerrit.wikimedia.org/r/342788 [16:15:41] (03PS4) 10Chad: Scap3: Prep MediaWiki to be available from /srv/deployment [puppet] - 10https://gerrit.wikimedia.org/r/342788 [16:16:15] godog: PS4 should do it [16:16:18] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.71 ms [16:16:39] RainbowSprinkles: nice, I'm running the compiler [16:17:41] !log T111113: Enabling RESTBase client encryption on (remaining) codfw nodes [16:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:47] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [16:18:41] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] "PCC https://puppet-compiler.wmflabs.org/5852/" [puppet] - 10https://gerrit.wikimedia.org/r/342788 (owner: 10Chad) [16:19:58] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.234 second response time [16:20:19] RainbowSprinkles: {{done}} [16:20:38] Yay :) [16:20:58] It's gonna go out to all mw* nodes and we don't need it urgently, so can just wait for puppet to do it itself [16:21:02] No need for a manual run anywhere [16:22:28] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:22:29] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:22:39] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:22:42] Fuck. [16:22:45] What happened? [16:22:48] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:22:48] PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:23:16] RainbowSprinkles: Cannot create /srv/deployment/mediawiki; parent directory /srv/deployment does not exist [16:23:19] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:23:24] Gahhhhh wtf.... [16:23:28] I'll revert. [16:23:28] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:23:29] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:23:29] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:23:38] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:23:39] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:23:39] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:23:39] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:23:39] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:23:39] PROBLEM - puppet last run on mw2255 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:23:48] (03PS1) 10Chad: Revert "Scap3: Prep MediaWiki to be available from /srv/deployment" [puppet] - 10https://gerrit.wikimedia.org/r/343909 [16:23:48] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:23:56] Revert commit ^^^^ [16:23:59] godog [16:24:17] sigh [16:24:18] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:24:24] stupid puppet :D [16:24:28] PROBLEM - puppet last run on mw1270 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:24:29] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:24:29] PROBLEM - puppet last run on mw2219 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:24:34] it can't even create a parent dir :} [16:24:38] PROBLEM - puppet last run on mw2091 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:24:38] PROBLEM - puppet last run on mw2258 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:24:38] PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki] [16:24:50] sorry about the spam, stopping ircecho [16:24:58] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.689 second response time [16:26:04] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Revert "Scap3: Prep MediaWiki to be available from /srv/deployment" [puppet] - 10https://gerrit.wikimedia.org/r/343909 (owner: 10Chad) [16:26:37] Sorry about that. Should've caught it :( [16:26:59] np, I assumed mediawiki::scap was only on deployment server heh [16:27:44] You can apparently use an array in file: [16:27:47] hashar: https://projects.puppetlabs.com/issues/6368 [16:27:49] https://projects.puppetlabs.com/issues/86 [16:27:53] " [16:27:54] File type should autorequire all parents [16:28:25] file {['/top','/top/second','/top/second/deep','/top/second/deep/deepest']: [16:28:26] ensure => directory [16:28:26] } [16:28:41] Yeah, but that doesn't work in this particular use case of mine. [16:29:55] godog: I'll work on a new version of this...put up for next puppetswat I guess [16:29:58] * RainbowSprinkles sighs [16:30:30] mutante i found out that cc is an exclusive feature to NoteDB [16:30:31] RainbowSprinkles: hehe ok, feel free to add me beforehand [16:30:49] paladox: ok [16:30:51] Also i found out that they are going to support cc for users even if they doint have an account on gerrit [16:31:45] paladox: does "cc"/"reviewer" manifest as a difference in web ui ? [16:31:51] yes [16:31:54] paladox: or is all of this just about email [16:32:05] It has a different field [16:32:08] ok [16:32:17] so it actually shows reviewers as one field and cc as another [16:32:21] "reviewer light" :p [16:32:25] lol [16:32:32] +0.5 to that [16:33:31] though we wont be able to use it. They should really support ReviewDB with that feature. [16:34:02] NoteDB is really slow though. [16:34:23] 06Operations, 10Traffic: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#2696835 (10Gilles) On the tech-mgmt meeting you mentioned this was underway, is there another phab task for it? [16:35:42] https://gerrit.googlesource.com/homepage/+/md-pages/docs/Notedb.md " If you are a casual user of Gerrit looking for documentation, you've come to the wrong place." heh ¯\_(ツ)_/¯ [16:36:24] honest, at least. :) [16:36:26] Yep its still beta [16:36:49] + once you've migrated to it theres no going back [16:37:02] i tryed that once by mistake [16:37:25] i wanted the hashtag feature. [16:38:51] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3118247 (10Volans) [16:41:19] "see challenges section" [16:41:22] Challenges? [16:41:22] TODO [16:41:30] Best documentation [16:41:37] Thanks gerrit <3 [16:41:46] !log T111113: Rolling restart of RESTBase, codfw, complete [16:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:52] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [16:42:19] RainbowSprinkles they promised to keep that doc up to date too [16:42:28] in one of the commit msg that's what they said [16:42:45] Seeeee, that's why I don't promise things! [16:42:55] It's just inviting me to break them / be a liar :p [16:42:59] Challenges: Keep a list of challenges [16:43:25] https://gerrit-review.googlesource.com/#/c/98070/ [16:43:29] "I hereby promise to try to keep this document up to date with the [16:43:29] current migration state." [16:45:08] ah here we go https://gerrit-review.googlesource.com/Documentation/dev-note-db.html [16:45:13] "It won't be standard until 3.0, and by then, this document surely will not exist in anything like its current form." [16:45:24] mutante ^^ [16:46:14] (03PS1) 10Volans: Failoid: add service to reject connections [puppet] - 10https://gerrit.wikimedia.org/r/343917 (https://phabricator.wikimedia.org/T160994) [16:46:16] 06Operations, 10Traffic: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3118280 (10BBlack) This is it. We're currently still testing/deploying the kernel that allows it to be enabled. After that we can do some testing/evaluation on BBR itself and report here.... [16:48:00] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3118288 (10jcrespo) This should not take much time, setting up next on the pipeline. [16:48:04] there will be a shower of recoveries shortly btw [16:48:29] paladox: Yeah, 3.0 for gerrit is probably coming about when MW 2.0 is ;-) [16:48:37] lol [16:48:57] I wonder what mediawiki 2.0 and gerrit 3.0 will look like [16:49:34] Probably exactly like they do now [16:49:39] Version #s are dumb anyway [16:49:45] depends, are you going to create polywiki skin? [16:49:47] lol [16:49:48] RECOVERY - puppet last run on mw2237 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:49:58] (03PS1) 10DCausse: [es5 upgrade] step 4: repool eqiad for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343919 (https://phabricator.wikimedia.org/T157479) [16:50:28] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:50:28] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:50:28] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:50:38] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [16:50:38] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:50:48] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:51:07] RainbowSprinkles mutante im adding a basic doc on polygerrit https://gerrit-review.googlesource.com/#/c/100292/ :) [16:51:18] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:51:26] they did not explain the config it had. Well now polygerrit will have a doc. [16:51:28] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:51:28] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:51:29] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:51:38] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:51:38] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [16:51:38] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:51:38] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:51:48] RECOVERY - puppet last run on mw2255 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:51:48] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:51:52] 06Operations, 10ops-codfw, 10hardware-requests: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3118313 (10Papaul) [16:52:07] !log T111113: Rolling restart of RESTBase, eqiad [16:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:12] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [16:52:18] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:52:28] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:52:29] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:52:29] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [16:52:29] RECOVERY - puppet last run on mw1270 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:52:29] RECOVERY - puppet last run on mw2219 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:52:38] RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:52:38] RECOVERY - puppet last run on mw2258 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:52:38] RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:52:58] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:53:18] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [16:53:38] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:54:18] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:54:18] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:54:18] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:54:18] RECOVERY - puppet last run on mw1174 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:54:28] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:54:28] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:54:29] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:54:29] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:54:29] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:54:38] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:54:38] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:54:38] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:54:48] RECOVERY - puppet last run on mw2217 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:55:19] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:55:48] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:55:48] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:55:48] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:56:18] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [16:56:19] RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:56:19] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:56:28] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:56:29] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:56:29] RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:56:38] RECOVERY - puppet last run on mw2254 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:56:38] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:56:38] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:56:48] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:56:48] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:57:15] (03CR) 10Dzahn: [C: 031] Add entries for failoid VMs [puppet] - 10https://gerrit.wikimedia.org/r/343890 (https://phabricator.wikimedia.org/T160994) (owner: 10Volans) [16:57:18] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:57:28] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:57:38] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:57:38] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:58:14] RainbowSprinkles i wonder if elasticsearch will improve index reliability [16:58:23] I think so. [16:58:26] I plan to switch to it [16:58:27] :) [16:58:40] (03CR) 10Alexandros Kosiaris: [C: 031] Add entries for failoid VMs [puppet] - 10https://gerrit.wikimedia.org/r/343890 (https://phabricator.wikimedia.org/T160994) (owner: 10Volans) [16:59:09] (03PS2) 10Volans: Add entries for failoid VMs [puppet] - 10https://gerrit.wikimedia.org/r/343890 (https://phabricator.wikimedia.org/T160994) [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170321T1700). [17:00:20] 06Operations, 06Performance-Team, 10Traffic: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3118331 (10Gilles) [17:00:21] no parsoid deploy today [17:01:46] (03CR) 10Volans: [C: 032] Add entries for failoid VMs [puppet] - 10https://gerrit.wikimedia.org/r/343890 (https://phabricator.wikimedia.org/T160994) (owner: 10Volans) [17:02:22] !log T111113: Rolling restart of RESTBase, eqiad, complete [17:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:28] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [17:05:59] (03CR) 10Marostegui: [C: 032] "server caught up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343892 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [17:08:42] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1092 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343892 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [17:08:52] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1092 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343892 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [17:09:38] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1092 with low weight - T137191 (duration: 00m 42s) [17:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:44] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [17:13:32] !log starting branch cut for 1.29.0-wmf.17 [17:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:03] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343923 (https://phabricator.wikimedia.org/T137191) [17:20:18] PROBLEM - puppet last run on es1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:22:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343923 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [17:23:46] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343923 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [17:23:58] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343923 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [17:24:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1092 weight - T137191 (duration: 00m 45s) [17:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:52] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [17:25:37] (03PS1) 10Giuseppe Lavagetto: restbase: use the dns discovery host for citoid [puppet] - 10https://gerrit.wikimedia.org/r/343926 (https://phabricator.wikimedia.org/T156100) [17:25:42] <_joe_> mobrovac: ^^ [17:26:08] marostegui jynus: We want to do this schema change during the switchover: https://phabricator.wikimedia.org/T159718 [17:26:35] Does it sound good to you? Can I proceed to making phab cards, etc. [17:27:11] Amir1: Hello, what do you exactly mean by: during the switchover? [17:27:28] Amir1, no indexes? [17:27:40] when the traffic is going to codfw so we can alter eqiad tables easily [17:28:14] Amir1: gotcha. But we can start with codfw once you've filled the phabricator task with the last schema changes agreed on :) [17:28:21] jynus: Let me check. I think there should be [17:28:28] there may be a discrepancy [17:28:33] on the gerrit patch [17:28:41] the indexes are commented on one file [17:28:49] but added on another [17:28:56] can you check that [17:29:02] sure [17:29:20] and the second request would be to use the template for blocked on schema change? [17:29:39] just add the tag, and edit the summary with the info asked on: [17:30:06] https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change [17:30:08] yeah, I was going to ask for the final gerrit patch, so we do not have to go thru a few of them and potentially make a mistake :) [17:30:15] yeah, this is my third or forth time asking for a schema change :) [17:30:37] Amir1, I will give it a look [17:30:59] because it may not need to wait for the dc swithcover [17:31:11] I don't think it needs to wait for it either [17:31:12] Okay, thanks :) [17:31:13] but it may, I think it is a high-traffic table [17:31:42] Just one of those one-by-one-depool-repool alters :) [17:31:47] marostegui, it may- almost all edits on wikidata modify that and almost all uncached reads may read it [17:31:52] sure [17:32:20] I am not sure about the master, though [17:32:26] we will have to check traffic there [17:32:51] The master I would wait for the swithcover [17:32:55] do codfw while it is standby [17:33:01] and then eqiad once it is stand by [17:33:44] sounds like a plan to me [17:33:57] (03CR) 10Mobrovac: [C: 031] restbase: use the dns discovery host for citoid [puppet] - 10https://gerrit.wikimedia.org/r/343926 (https://phabricator.wikimedia.org/T156100) (owner: 10Giuseppe Lavagetto) [17:36:11] (03PS1) 10Marostegui: db-eqiad.php: Increase weight db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343927 (https://phabricator.wikimedia.org/T137191) [17:36:55] <_joe_> mobrovac: I'm merging it [17:36:58] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [17:36:58] (03CR) 10Giuseppe Lavagetto: [C: 032] restbase: use the dns discovery host for citoid [puppet] - 10https://gerrit.wikimedia.org/r/343926 (https://phabricator.wikimedia.org/T156100) (owner: 10Giuseppe Lavagetto) [17:37:04] (03PS2) 10Giuseppe Lavagetto: restbase: use the dns discovery host for citoid [puppet] - 10https://gerrit.wikimedia.org/r/343926 (https://phabricator.wikimedia.org/T156100) [17:37:18] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] restbase: use the dns discovery host for citoid [puppet] - 10https://gerrit.wikimedia.org/r/343926 (https://phabricator.wikimedia.org/T156100) (owner: 10Giuseppe Lavagetto) [17:37:23] kk _joe_, when you run puppet I will restart a couple of rb nodes [17:37:38] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Import new kibana and logstash .debs to wikimedia experimental repository - https://phabricator.wikimedia.org/T160597#3118436 (10EBernhardson) a:05EBernhardson>03None [17:37:48] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:39:58] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [17:40:10] ^ I got it [17:41:21] thcipriani, marostegui, godog: I see you on tin, I'd like to deploy https://gerrit.wikimedia.org/r/#/c/343919/ any objections? [17:41:38] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.12 ms [17:41:59] dcausse: if you give me 1 minute so I can deploy this: https://gerrit.wikimedia.org/r/#/c/343927/ ? [17:42:06] marostegui: sure [17:42:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343927 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [17:42:15] will take no time :) [17:42:26] <_joe_> mobrovac: done [17:42:36] marostegui: no rush :) [17:42:58] <_joe_> mobrovac: I could restart rb on the test cluster [17:43:08] <_joe_> and see if the service checker screams somehow [17:43:31] yes, please _joe_ [17:43:37] dcausse: no objections, I'm not deploying just now. [17:43:37] <_joe_> ok I'll do that [17:43:46] thcipriani: thanks [17:43:56] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3118453 (10Papaul) a:03Marostegui Main board and CPU2 replacement complete. System back up online. [17:44:02] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343927 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [17:44:13] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343927 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [17:44:22] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3118461 (10Marostegui) Thanks Papaul! I will take it from here [17:44:51] dcausse: yup no objections [17:44:57] godog: thanks :) [17:45:03] just hanging out on tin, no deploys :) [17:45:09] ok :) [17:45:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1092 weight - T137191 (duration: 00m 42s) [17:45:12] dcausse: I am done :-) [17:45:16] marostegui: thanks [17:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:17] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [17:45:33] ok starting to deploy [17:46:00] (03CR) 10DCausse: [C: 032] [es5 upgrade] step 4: repool eqiad for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343919 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [17:48:57] (03Merged) 10jenkins-bot: [es5 upgrade] step 4: repool eqiad for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343919 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [17:49:04] (03CR) 10jenkins-bot: [es5 upgrade] step 4: repool eqiad for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343919 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [17:49:18] RECOVERY - puppet last run on es1016 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:50:18] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:53:21] !log dcausse@tin Synchronized wmf-config/CommonSettings.php: [es5 upgrade] step 4: repool eqiad for writes (1/3) (duration: 00m 42s) [17:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:38] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: [es5 upgrade] step 4: repool eqiad for writes (2/3) (duration: 00m 42s) [17:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:58] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.205 second response time [17:55:41] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3118507 (10Marostegui) @Papaul can you check if the network cable is plugged? The system doesn't have network. I ran: ``` root@es2015:~# rm -fr /etc/udev/rules.d/70-persistent-net.rules ``` As it is an... [17:56:03] !log dcausse@tin Synchronized wmf-config/CirrusSearch-common.php: [es5 upgrade] step 4: repool eqiad for writes (3/3) (duration: 00m 42s) [17:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:26] ok elastic@eqiad repooled for mediawiki [17:57:59] (03PS1) 1020after4: Phabricator: use elasticsearch 5 in codfw and eqiad [puppet] - 10https://gerrit.wikimedia.org/r/343936 [17:58:10] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: use elasticsearch 5 in codfw and eqiad [puppet] - 10https://gerrit.wikimedia.org/r/343936 (owner: 1020after4) [17:58:12] (03PS2) 1020after4: Phabricator: use elasticsearch 5 in codfw and eqiad [puppet] - 10https://gerrit.wikimedia.org/r/343936 [17:59:53] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.619 second response time [18:01:52] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3118566 (10leila) @Ottomata do you know if databases in staging in analytics-slave will be copied to some other place if we're decommissioning the suggested machines? I'm aski... [18:02:11] !log refreshing phabricator's elasticsearch index in eqiad [18:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:24] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3118585 (10Ottomata) @leila, we can dump and copy to analytics-store, as long as there aren't any database.table name collisions. [18:06:43] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:07:48] (03PS1) 10Jcrespo: mariadb: Increase db1094 weight after initial pool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343937 (https://phabricator.wikimedia.org/T160832) [18:08:30] jynus: can you merge a change I am about to push? to increase db1092 weight too? [18:08:38] sure [18:08:42] which one? [18:08:47] give me a sec [18:08:48] (03CR) 10EBernhardson: "should this be merged now?" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/341826 (owner: 10DCausse) [18:08:52] Or do you want me to do it? [18:08:58] let me do it [18:08:58] no no I have it almost ready [18:09:02] ah, ok [18:09:19] (03CR) 10Jcrespo: [C: 032] mariadb: Increase db1094 weight after initial pool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343937 (https://phabricator.wikimedia.org/T160832) (owner: 10Jcrespo) [18:09:43] (03PS1) 10Marostegui: db-eqiad.php: Increase db1092 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343938 (https://phabricator.wikimedia.org/T137191) [18:09:45] jynus ^ [18:10:11] s7, is api expected to have 2 large servers? [18:10:17] or is there something missing there? [18:10:26] do you remember? [18:10:43] (03Merged) 10jenkins-bot: mariadb: Increase db1094 weight after initial pool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343937 (https://phabricator.wikimedia.org/T160832) (owner: 10Jcrespo) [18:10:52] (03CR) 10jenkins-bot: mariadb: Increase db1094 weight after initial pool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343937 (https://phabricator.wikimedia.org/T160832) (owner: 10Jcrespo) [18:10:59] mmmm [18:11:33] I cannot remember, maybe it is a left over from previous pool/repool stuff? [18:12:47] https://gerrit.wikimedia.org/r/#/c/311680/3/wmf-config/db-eqiad.php [18:13:48] andre__, are you here? [18:14:04] (03CR) 10Jcrespo: [C: 032] db-eqiad.php: Increase db1092 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343938 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [18:14:10] (03PS2) 10Jcrespo: db-eqiad.php: Increase db1092 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343938 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [18:14:35] ah, so it used to have only 1 api [18:14:44] yes, looks so [18:14:45] ok, if the weight is low on one of the servers [18:14:53] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.217 second response time [18:15:09] which it does [18:15:14] 100:1 [18:15:23] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:15:28] so it technically only has 1 api, but it fails over [18:15:41] I think I did that [18:17:05] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1092 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343938 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [18:18:45] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Increase weight of db1092 and db1094 (duration: 00m 42s) [18:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:14] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [18:26:33] (03CR) 10DCausse: [C: 031] "Yes I think so, gehel what do you think? did you have to make other adjustments?" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/341826 (owner: 10DCausse) [18:28:15] (03CR) 10Gehel: "We also have logstash using those plugins. I'd prefer not to merge before we have a good view on where we are going with logstash." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/341826 (owner: 10DCausse) [18:30:21] PROBLEM - puppet last run on roentgenium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX] [18:31:26] I'm on it [18:32:37] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3118697 (10leila) sounds good, @Ottomata. [18:34:21] RECOVERY - puppet last run on roentgenium is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:35:45] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: LDF endpoint ordering is not stable between servers when paging - https://phabricator.wikimedia.org/T159574#3118713 (10Gehel) At this point, the only workable option is the "single LDF server" (appart from abandoning LDF completely). So let's... [18:37:01] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [18:37:29] 06Operations, 13Patch-For-Review, 07discovery-system: Create the failoid service as fallback for the DNS discovery - https://phabricator.wikimedia.org/T160994#3118722 (10Volans) [18:39:51] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.362 second response time [18:40:01] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [18:43:21] RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:46:01] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [18:51:01] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [18:52:36] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3118791 (10jcrespo) es2015.codfw.wmnet needs a mysql_upgrade run before restarting replication. BTW, I fixed some things on the new package: now you can run mysql_upgrade correctly on the path. [18:55:22] chasemp PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack i presume that isen't good? [18:56:08] paladox: we know about it, thanks [18:56:14] oh ok [18:57:47] (03PS3) 10Dzahn: testreduce: Enable devAPI in Parsoid to enable /_rt/ in results UI [puppet] - 10https://gerrit.wikimedia.org/r/343887 (owner: 10Subramanya Sastry) [18:58:10] (03CR) 10Dzahn: [C: 032] testreduce: Enable devAPI in Parsoid to enable /_rt/ in results UI [puppet] - 10https://gerrit.wikimedia.org/r/343887 (owner: 10Subramanya Sastry) [18:59:39] (03PS3) 10Dzahn: parsoid-testing: Fix nginx config for serving individual visualdiff pngs [puppet] - 10https://gerrit.wikimedia.org/r/343886 (owner: 10Subramanya Sastry) [18:59:52] (03CR) 10Dzahn: [C: 032] parsoid-testing: Fix nginx config for serving individual visualdiff pngs [puppet] - 10https://gerrit.wikimedia.org/r/343886 (owner: 10Subramanya Sastry) [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170321T1900). [19:00:18] * thcipriani does [19:01:05] (03PS3) 10Dzahn: testreduce: Migrate away from parsoid-tests.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/343888 (owner: 10Subramanya Sastry) [19:01:11] (03PS4) 10Dzahn: testreduce: Migrate away from parsoid-tests.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/343888 (owner: 10Subramanya Sastry) [19:01:20] (03CR) 10Dzahn: [C: 032] testreduce: Migrate away from parsoid-tests.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/343888 (owner: 10Subramanya Sastry) [19:03:07] (03PS4) 10Rush: nova: nova-fullstack keep instance count for debugging [puppet] - 10https://gerrit.wikimedia.org/r/343636 [19:03:42] !log thcipriani@tin Started scap: testwiki to php-1.29.0-wmf.17 and rebuild l10n cache [19:03:45] !do_whatever_is_needed_to_get_that_whole_depedency_chain_merged_rebase_it_all_whichever_is_first_dont_care_thx [19:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:33] * mutante is talking to jenkins-bot, sry [19:04:37] (03CR) 10Andrew Bogott: [C: 031] nova: nova-fullstack keep instance count for debugging [puppet] - 10https://gerrit.wikimedia.org/r/343636 (owner: 10Rush) [19:05:20] (03PS4) 10Dzahn: parsoid-testing: Fix nginx config for serving individual visualdiff pngs [puppet] - 10https://gerrit.wikimedia.org/r/343886 (owner: 10Subramanya Sastry) [19:06:12] subbu: welcome back from lunch, i'm about to get that stuff merged [19:06:13] (03CR) 10Volans: "Puppet compiler results available here: https://puppet-compiler.wmflabs.org/5854/" [puppet] - 10https://gerrit.wikimedia.org/r/343917 (https://phabricator.wikimedia.org/T160994) (owner: 10Volans) [19:07:02] was just fighting with the right order/rebase [19:08:10] thanks [19:08:18] nginx config change coming up on ruthenium first [19:08:29] (03PS4) 10Dzahn: testreduce: Enable devAPI in Parsoid to enable /_rt/ in results UI [puppet] - 10https://gerrit.wikimedia.org/r/343887 (owner: 10Subramanya Sastry) [19:09:07] (03CR) 10Dzahn: [V: 032 C: 032] testreduce: Enable devAPI in Parsoid to enable /_rt/ in results UI [puppet] - 10https://gerrit.wikimedia.org/r/343887 (owner: 10Subramanya Sastry) [19:09:51] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.277 second response time [19:10:09] !log ruthenium - dev API enabled in parsoid config for parsoid rt tests [19:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:29] (03PS5) 10Dzahn: testreduce: Migrate away from parsoid-tests.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/343888 (owner: 10Subramanya Sastry) [19:13:01] RECOVERY - Host es2015 is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms [19:13:27] subbu: done, all your pending puppet changes applied on ruthenium now. no issues from puppet or service restarts [19:13:34] thanks! [19:13:40] yw [19:14:05] papaul, did you connect the ethernet?^ [19:14:20] i think "parsoid-tests" appears in another place, let me check [19:14:29] (besides DNS of course) [19:14:51] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.367 second response time [19:15:44] ^ madhuvishy just fyi what I mean, odd [19:17:03] (03PS5) 10Rush: nova: nova-fullstack keep instance count for debugging [puppet] - 10https://gerrit.wikimedia.org/r/343636 [19:17:16] (03CR) 10Rush: [V: 032 C: 032] nova: nova-fullstack keep instance count for debugging [puppet] - 10https://gerrit.wikimedia.org/r/343636 (owner: 10Rush) [19:17:43] subbu: we have settings in hiera, "logging_name" and "statsd_prefix" are set to "parsoid-tests". wondering if it stays one thing for logging [19:18:48] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3118898 (10jcrespo) I have restarted it myself- its ping returned- I assume Papaul did something? [19:18:57] ah .. that is just a tag for kibana and grafana .. if we change that, we'll have to mess with kibana dashboard. so, let us wait on that for now. [19:19:08] *nod*, ok, yes [19:19:27] 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3118904 (10Paladox) Can this be closed as resolved please? [19:19:54] 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3118905 (10demon) Haven't had time. [19:21:01] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [19:22:03] !log clean out admin-monitoring for nova-fullstack T160908 [19:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:09] T160908: Instance creation fails before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908 [19:22:32] 06Operations, 06Labs: Instance creation fails before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3118922 (10chasemp) https://gerrit.wikimedia.org/r/#/c/343636/ [19:23:38] (03PS1) 10Dzahn: misc-varnish/parsoid-tests: remove parsoid-tests backend [puppet] - 10https://gerrit.wikimedia.org/r/343948 [19:24:32] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3118936 (10Marostegui) Papaul just replugged the cable and it works now: ``` root@es2015:~# mii-tool eth0 eth0: negotiated, link ok ``` Thanks @Papaul! [19:25:30] 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3118954 (10Paladox) >>! In T160122#3118905, @demon wrote: > Haven't had time. Oh, sorry, i thought it was done. [19:25:50] (03PS3) 10Dzahn: Phabricator: use elasticsearch 5 in codfw and eqiad [puppet] - 10https://gerrit.wikimedia.org/r/343936 (owner: 1020after4) [19:26:21] (03CR) 1020after4: [C: 031] Phabricator: use elasticsearch 5 in codfw and eqiad [puppet] - 10https://gerrit.wikimedia.org/r/343936 (owner: 1020after4) [19:27:13] mutante i found that gwt takes 5-10 seconds to load on gerrit-new.wmflabs.org (may be 1 second less) But polygerrit is 0-1 second. [19:27:26] thats a big improvement :). Expecially when doing reviews. [19:28:52] Writing a comment is instant. Cherry picking is instant. [19:29:01] (03CR) 1020after4: [C: 031] "http://puppet-compiler.wmflabs.org/5855/" [puppet] - 10https://gerrit.wikimedia.org/r/343936 (owner: 1020after4) [19:29:12] cr +2 is instant. [19:29:17] (03CR) 10Paladox: [C: 031] Phabricator: use elasticsearch 5 in codfw and eqiad [puppet] - 10https://gerrit.wikimedia.org/r/343936 (owner: 1020after4) [19:32:19] (03CR) 10Dzahn: [C: 032] "so pretty now how each DC uses the other and it's all in Hiera and just .yaml edits. cool" [puppet] - 10https://gerrit.wikimedia.org/r/343936 (owner: 1020after4) [19:33:49] !log iridium - ran puppet after gerrit:343936 - phabricator config change to use cluster search applied [19:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:11] !log phab2001 - same as iridium, phab search config change [19:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:54] twentyafterfour: done:) very nice how it's all per hiearadata/role/common// [19:37:14] eh, role/ [19:37:30] paladox: nice. and i like the search change ^ [19:37:47] yep :) [19:39:15] (03CR) 10Dzahn: "has this been made obsolete by https://gerrit.wikimedia.org/r/#/c/343936/ ? or still needed for labs? in any case i think it should now" [puppet] - 10https://gerrit.wikimedia.org/r/342276 (owner: 10Paladox) [19:39:23] bblack: At some point when you have time, would you be able to deploy https://gerrit.wikimedia.org/r/#/c/341207/ ? [19:39:26] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3119071 (10Marostegui) I am not sure what has happened, but something weird and maybe we have lost its data. The server got rebooted itself while I was on it and started to run PXE boot and started the... [19:40:15] (03PS10) 10Paladox: Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276 [19:40:23] (03PS11) 10Paladox: Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276 [19:41:02] paladox: rebased to .. nothing? [19:41:18] Hmm, looks like it is all done by phabricator_cluster_search now [19:41:31] paladox: see https://gerrit.wikimedia.org/r/#/c/343936/3/hieradata/role/codfw/phabricator/main.yaml [19:41:37] twentyafterfour how can i do this on labs hira for phabricator please? [19:41:43] paladox: 37,38 that is read/write for mysql search [19:41:47] bawolff: yes, now? [19:41:59] oh wait i see [19:42:00] paladox: can you set it in Hiera in labs (repo or wiki page) [19:42:01] bblack: Sure [19:42:12] There's no rush in particular, any time is fine [19:42:22] paladox: so i was thinking he probably did what you wanted to do already [19:42:25] I am setting it on horozon ui as thats where the other configs for phabricator are set [19:42:27] (03CR) 10BBlack: [C: 032] Extend the upload Content-Security-Policy test to other large wikis [puppet] - 10https://gerrit.wikimedia.org/r/341207 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [19:42:33] (03PS2) 10BBlack: Extend the upload Content-Security-Policy test to other large wikis [puppet] - 10https://gerrit.wikimedia.org/r/341207 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [19:42:35] paladox: and you can just use it now and don't need your patch anymore [19:42:39] yep [19:42:40] (03CR) 10BBlack: [V: 032 C: 032] Extend the upload Content-Security-Policy test to other large wikis [puppet] - 10https://gerrit.wikimedia.org/r/341207 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [19:42:52] paladox: sounds good, ok [19:43:41] (03Abandoned) 10Dzahn: Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276 (owner: 10Paladox) [19:43:44] mutante it seems that operations/puppet repo keeps getting the user changed in .git/ [19:44:00] Should this be fixed some where in the repo. I mean the actual repo. [19:44:29] As i have to keep doing sudo chown -R gitpuppet:gitpuppet ../puppet [19:44:38] paladox: where? on your laptop? [19:44:46] No on puppet-phabricator [19:45:12] Oh wait [19:45:13] i had one commit there [19:45:29] i dont really know, but it sounds like you were maybe working as root in that repo? [19:45:38] Though that dosen't explain why it fails with [19:45:38] error: insufficient permission for adding an object to repository database .git/objects [19:45:38] fatal: failed to write object [19:45:38] fatal: unpack-objects failed [19:46:18] the answer to the question if it should be fixed inside the actual repo is No [19:46:18] (03CR) 10Yuvipanda: "One super minor nit, lgtm otherwise code-wise. Have you tested this? Not sure how to safely do so tho. Maybe make a copy of the meta db an" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/343894 (https://phabricator.wikimedia.org/T158420) (owner: 10Madhuvishy) [19:46:25] oh [19:47:50] paladox: similar stuff happens when there is a git repo that puppet git pulls to but also a human user (as root) [19:48:10] oh [19:48:18] i didnt run the git pull as root though [19:48:22] always as gitpuppet [19:48:29] paladox: just fix it with chown -R like you did [19:48:37] and see when it happens again [19:48:45] yep it works doing that [19:48:51] PROBLEM - parsoid on wtp2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:41] RECOVERY - parsoid on wtp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.178 second response time [19:50:47] (03PS3) 10Madhuvishy: tools: Update maintain-dbusers to create labsdb accounts for tools users [puppet] - 10https://gerrit.wikimedia.org/r/343894 (https://phabricator.wikimedia.org/T158420) [19:51:31] PROBLEM - parsoid on wtp2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:52:22] RECOVERY - parsoid on wtp2020 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.150 second response time [19:52:35] bblack: thank you :) [19:53:41] PROBLEM - parsoid on wtp2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:31] RECOVERY - parsoid on wtp2011 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.099 second response time [19:54:59] !log thcipriani@tin Finished scap: testwiki to php-1.29.0-wmf.17 and rebuild l10n cache (duration: 51m 16s) [19:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:54] (03PS5) 10Paladox: Phabricator: Allow us to install php7.1 for testing on labs. [puppet] - 10https://gerrit.wikimedia.org/r/343211 [20:00:46] (03PS6) 10Paladox: Phabricator: Allow us to install php7.1 for testing on labs. [puppet] - 10https://gerrit.wikimedia.org/r/343211 [20:00:59] (03PS7) 10Paladox: Phabricator: Allow us to install php7.1 for testing on labs. [puppet] - 10https://gerrit.wikimedia.org/r/343211 [20:01:46] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3119152 (10Marostegui) I forced it to boot from disk, but it is not booting. The RAID looks healthy from the raid controller (and bios) raid menu, the virtual disk is there. But the server isn't booting... [20:01:50] (03PS4) 10Madhuvishy: tools: Update maintain-dbusers to create labsdb accounts for tools users [puppet] - 10https://gerrit.wikimedia.org/r/343894 (https://phabricator.wikimedia.org/T158420) [20:02:06] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Allow us to install php7.1 for testing on labs. [puppet] - 10https://gerrit.wikimedia.org/r/343211 (owner: 10Paladox) [20:04:10] (03PS1) 10Jcrespo: mariadb: Pool db1094 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343955 (https://phabricator.wikimedia.org/T160832) [20:08:11] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [20:09:51] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.245 second response time [20:10:11] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1278 [20:12:12] madhuvishy: is the maintain-dbusers related to work you're doing? ^ [20:12:26] no [20:12:32] i haven't touched it [20:12:36] ok [20:12:38] looking anyway [20:12:52] madhuvishy: ok! thank you :) [20:15:11] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1577 [20:15:39] (03PS1) 10Thcipriani: Group0 to 1.29.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343959 [20:16:42] (03CR) 10Thcipriani: [C: 032] Group0 to 1.29.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343959 (owner: 10Thcipriani) [20:19:23] ACKNOWLEDGEMENT - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1577 Jeff_Green query was killed on master, stuck query, investigating [20:20:11] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 1292003 Threads: 2 Questions: 29678189 Slow queries: 7146 Opens: 8217 Flush tables: 1 Open tables: 592 Queries per second avg: 22.970 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [20:22:20] thcipriani: lemme know when things are calm for a moment, i need to flip a config flag in cirrus. A bug is causing ~100 errors per minute [20:22:31] (03PS1) 10EBernhardson: Turn completion suggester off until length error is fixed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343961 (https://phabricator.wikimedia.org/T161001) [20:22:36] ebernhardson: will do [20:22:46] just have to sync wikiversions.... [20:24:51] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.825 second response time [20:27:54] (03Merged) 10jenkins-bot: Group0 to 1.29.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343959 (owner: 10Thcipriani) [20:28:03] (03CR) 10jenkins-bot: Group0 to 1.29.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343959 (owner: 10Thcipriani) [20:29:11] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: Group0 to 1.29.0-wmf.17 [20:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:28] ^ ebernhardson you should be all clear to do the needful on tin [20:29:34] thcipriani: thanks [20:29:44] (03PS2) 10EBernhardson: Turn completion suggester off until length error is fixed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343961 (https://phabricator.wikimedia.org/T161001) [20:29:48] (03PS1) 10Chad: Enable headers-sent logging bucket [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343964 [20:29:50] (03CR) 10EBernhardson: [C: 032] Turn completion suggester off until length error is fixed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343961 (https://phabricator.wikimedia.org/T161001) (owner: 10EBernhardson) [20:31:42] (03Merged) 10jenkins-bot: Turn completion suggester off until length error is fixed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343961 (https://phabricator.wikimedia.org/T161001) (owner: 10EBernhardson) [20:31:51] (03CR) 10jenkins-bot: Turn completion suggester off until length error is fixed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343961 (https://phabricator.wikimedia.org/T161001) (owner: 10EBernhardson) [20:33:32] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: T161001 Turn off completion suggester until length error is fixed (duration: 00m 44s) [20:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:38] T161001: Completion Suggester errors after elastic5 rollout - https://phabricator.wikimedia.org/T161001 [20:35:40] thcipriani: all done, thaks! [20:35:54] ebernhardson: awesome, thanks :) [20:38:11] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [20:38:22] (03PS2) 10Chad: Enable headers-sent logging bucket [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343964 [20:39:01] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [20:43:17] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3119362 (10aaron) >>! In T156924#3117017, @Joe wrote: >>>! In T156924#3116897, @aaron wrote: >>... [20:51:48] (03CR) 10Chad: [C: 032] Enable headers-sent logging bucket [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343964 (owner: 10Chad) [20:54:31] (03Merged) 10jenkins-bot: Enable headers-sent logging bucket [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343964 (owner: 10Chad) [20:54:44] (03CR) 10jenkins-bot: Enable headers-sent logging bucket [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343964 (owner: 10Chad) [20:56:37] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: logging for bad header stuff (duration: 00m 52s) [20:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:09] (03PS1) 10Thcipriani: Revert "Group0 to 1.29.0-wmf.17" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343970 [20:57:22] (03CR) 10Thcipriani: [C: 032] Revert "Group0 to 1.29.0-wmf.17" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343970 (owner: 10Thcipriani) [20:58:01] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:58:38] (03Merged) 10jenkins-bot: Revert "Group0 to 1.29.0-wmf.17" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343970 (owner: 10Thcipriani) [20:58:46] (03CR) 10jenkins-bot: Revert "Group0 to 1.29.0-wmf.17" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343970 (owner: 10Thcipriani) [20:59:21] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: Revert Group0 to 1.29.0-wmf.17 [20:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:41] PROBLEM - parsoid on wtp2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:02:31] RECOVERY - parsoid on wtp2008 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.118 second response time [21:12:31] PROBLEM - parsoid on wtp2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:21] RECOVERY - parsoid on wtp2009 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.108 second response time [21:21:21] (03CR) 10Chad: "Please stop re-adding me to this change, I do not care." [puppet] - 10https://gerrit.wikimedia.org/r/343211 (owner: 10Paladox) [21:21:45] (03CR) 10Paladox: "> Please stop re-adding me to this change, I do not care." [puppet] - 10https://gerrit.wikimedia.org/r/343211 (owner: 10Paladox) [21:25:01] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [21:26:21] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:41:10] (03PS3) 10Milimetric: Fix labs-specific Dashiki hack with generic enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336446 (https://phabricator.wikimedia.org/T161038) [21:43:43] (03PS15) 10Paladox: Fix some Debian lintian warnnings for the gerrit package [debs/gerrit] - 10https://gerrit.wikimedia.org/r/343297 [21:44:31] Some how wikimedia is hitting yahoo's internal filter. [21:44:51] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.245 second response time [21:45:03] Me receving emails from wikimedia have become slow. Yahoo tells me it's probaly the internal filter they have. [21:49:51] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.370 second response time [21:54:21] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [22:00:18] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3119850 (10mobrovac) What about setting ORES up behind RESTBase and use Cassandra? In that case it wouldn't even matter where the re... [22:06:37] ebernhardson: It seems mwgrep is broken [22:06:42] "Unknown error: no [query] registered for [filtered]" [22:06:48] Anything recent that changed? [22:06:57] Krinkle: upgraded elasticsearch versions from 2.x to 5.x [22:07:01] they love breaking api's ... [22:07:20] It worked less than a day ago [22:07:32] Krinkle: the eqiad cluster was upgraded this morning [22:07:52] ebernhardson: Do you think the fix would be simple? Or should I file a task? [22:08:01] https://github.com/wikimedia/puppet/blob/production/modules/scap/files/mwgrep [22:09:25] Krinkle: probably needs a general task, you can s/filtered/bool/ to solve that issue, but next step will be that elasticseach 5.x doesn't like querying > 1000 shards in a single request, and you're querying 3k [22:09:39] its just a setting ... but probably needs some consideration [22:09:55] ebernhardson: es upgrade task? [22:10:14] Krinkle: T151324 [22:10:15] T151324: [epic] System level upgrade for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151324 [22:11:11] Krinkle: for the moment you could `sed 's/filtered/bool' $(which mwgrep) > ~/bin/mwgrep` and run it, i put in some transient settings that will let it avoid the 1k shards per req while we figure out if there is a better answer [22:12:05] (03CR) 10Dzahn: [C: 04-1] "better than before for sure, but i don't think it should be "if > jessie". it should be "if not in production" in one way or another. we n" [puppet] - 10https://gerrit.wikimedia.org/r/343211 (owner: 10Paladox) [22:13:32] (03PS22) 10Dzahn: gerrit: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/342692 [22:22:03] jouncebot: next [22:22:03] In 0 hour(s) and 37 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170321T2300) [22:28:17] !log Create Translate tables on betawikiversity (T160120) [22:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:23] T160120: Install Extension:Translate on beta.wikiversity - https://phabricator.wikimedia.org/T160120 [22:31:31] (03PS1) 10Dereckson: Enable Mapframe on sv.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344043 (https://phabricator.wikimedia.org/T161032) [22:32:55] (03CR) 10Paladox: gerrit: convert to profile/role structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342692 (owner: 10Dzahn) [22:33:17] (03PS1) 10EBernhardson: Update mwgrep for elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) [22:33:48] @seen codezee [22:33:48] yannf: Last time I saw codezee they were quitting the network with reason: Quit: Leaving N/A at 1/10/2017 8:39:26 PM (70d1h54m21s ago) [22:34:04] "The arbitration committee is neither the Ten Commandments on the stone nor the divine inviolability from heaven" heh [22:34:27] Google translated Korean talk pages [22:34:34] andre__, are you here? [22:34:58] yannf: yes, but I think we're not going to discuss Operations issues? :P [22:36:09] (03PS1) 10Dereckson: Enable Translate on beta.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344045 (https://phabricator.wikimedia.org/T160120) [22:37:57] (03CR) 10Krinkle: [C: 031] "Change to mwgrep verified on terbium and works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) (owner: 10EBernhardson) [22:44:53] !log Run namespaceDupes on pnbwiki (T159976) [22:44:57] !log lists: deactivate arbcom-ko per T160892 and Google translation of Korean talk pages [22:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:00] T159976: Run namespaceDupes.php for wikis in Western Punjabi (pnb) - https://phabricator.wikimedia.org/T159976 [22:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:10] T160892: Close arbcom-ko mailing list - https://phabricator.wikimedia.org/T160892 [22:49:37] (03CR) 10Paladox: [C: 031] "Tested in labs and works" [puppet] - 10https://gerrit.wikimedia.org/r/342692 (owner: 10Dzahn) [22:50:06] (03CR) 10Dzahn: [C: 032] "thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/342692 (owner: 10Dzahn) [22:50:22] your welcome :) [22:50:30] (03CR) 10Chad: gerrit: convert to profile/role structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342692 (owner: 10Dzahn) [22:51:12] (03CR) 10Paladox: gerrit: convert to profile/role structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342692 (owner: 10Dzahn) [22:51:18] ok, so the Hiera part in labs has been adjusted [22:51:22] because i moved some stuff around [22:51:33] but now it will not affect the test intsances either [22:52:13] Notice: Finished catalog run in 10.59 seconds [22:52:21] nothing as expected :) yay [22:52:21] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:52:56] (03CR) 10Dzahn: "confirmed no-op on cobalt :)" [puppet] - 10https://gerrit.wikimedia.org/r/342692 (owner: 10Dzahn) [22:57:38] 06Operations, 10Traffic: Define 3-host infra cluster for traffic pops - https://phabricator.wikimedia.org/T96852#3120141 (10BBlack) [22:58:09] "Moving to APFS is like rebuilding the HFS filesystem.It doesn't touch any files." [22:58:12] 06Operations: ulsfo: add a DNS recursor - https://phabricator.wikimedia.org/T82996#3120148 (10BBlack) [22:58:16] 06Operations, 10Traffic: Define 3-host infra cluster for traffic pops - https://phabricator.wikimedia.org/T96852#1227571 (10BBlack) [22:58:16] (or it touches ALL the files ) [22:58:30] " [22:58:30] Correct, the transition will be seamless. [22:58:34] (03PS1) 10Chad: Gerrit: Make $host required, rather than optional then failing [puppet] - 10https://gerrit.wikimedia.org/r/344050 [22:58:43] "I have learned to be terrified whenever a vendor said that to me about an update." [22:58:46] mutante my iphone was converted already [22:58:55] it was a very smooth transition actually [22:59:00] jouncebot: refresh [22:59:03] I refreshed my knowledge about deployments. [22:59:16] thanks mutante [22:59:52] ebernhardson: I CR +2 your changes [22:59:57] (03CR) 10Paladox: [C: 031] Gerrit: Make $host required, rather than optional then failing [puppet] - 10https://gerrit.wikimedia.org/r/344050 (owner: 10Chad) [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170321T2300). [23:00:04] ebernhardson, RoanKattouw, and Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:10] revi: no problem [23:00:12] Okay, I can SWAT this evening. [23:00:16] yay [23:00:32] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS1257/IPv6: Active, AS1257/IPv4: Active [23:01:15] (03CR) 10Dzahn: [C: 032] Gerrit: Make $host required, rather than optional then failing [puppet] - 10https://gerrit.wikimedia.org/r/344050 (owner: 10Chad) [23:01:24] yes, that was a nice one, RainbowSprinkles [23:01:37] $host was not even a parameter before [23:01:51] $host has been around for awhile [23:01:54] It was legacy from the refactoring we did last year [23:02:00] the fail() check [23:02:09] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/343437/ is Not until March 28th [23:02:10] I'd been meaning to clean that up [23:02:17] ah, yes:) [23:02:18] \o [23:02:22] mutante: any idea on https://phabricator.wikimedia.org/T160883 ? :-p [23:02:32] RoanKattouw: perhaps should you remove your CR -2 if you changed your mind [23:02:48] Oh sorry yes [23:02:53] Ahm wait what [23:02:54] (03PS2) 10Volans: Better handling of item return codes [switchdc] - 10https://gerrit.wikimedia.org/r/343633 (https://phabricator.wikimedia.org/T160178) [23:02:55] That is not the one I meant [23:02:57] revi: no [23:03:06] mutante: I also think the ferm calls should go into the module and out of the profile. The http ones especially shouldn't be turned on unless we're master [23:03:10] uhm, kk [23:03:11] Dereckson: https://gerrit.wikimedia.org/r/#/c/343436 is for today [23:03:22] ok [23:03:33] Whoops sorry got the numbers wrong on the wiki apparently [23:03:48] I needed 36 and 35, not 36 and 37 [23:03:51] 343435 + 343436? [23:03:51] ok [23:04:05] (03PS2) 10Dereckson: Add rcenhancedfilters to BetaFeatures whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343435 (owner: 10Catrope) [23:04:07] Phew, glad that -2 was there [23:05:05] RainbowSprinkles: hmm. having ferm rules in profile was specifically following the example from the puppet coding page though [23:05:20] Fair enough. Other option is moving the master/slave detection to the profile [23:05:25] And passing slave status to the module [23:05:32] (which is all the module needs to know tbh) [23:05:54] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343435 (owner: 10Catrope) [23:09:07] (03Merged) 10jenkins-bot: Add rcenhancedfilters to BetaFeatures whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343435 (owner: 10Catrope) [23:09:17] (03CR) 10jenkins-bot: Add rcenhancedfilters to BetaFeatures whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343435 (owner: 10Catrope) [23:09:37] (03PS3) 10Dereckson: Enable RCFilters beta feature on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343436 (owner: 10Catrope) [23:09:42] (03CR) 10Dereckson: [C: 032] Enable RCFilters beta feature on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343436 (owner: 10Catrope) [23:09:51] (03PS1) 10Andrew Bogott: Nova scheduler: Prefer virthosts with lower CPU usage [puppet] - 10https://gerrit.wikimedia.org/r/344051 (https://phabricator.wikimedia.org/T161006) [23:10:37] RoanKattouw: 35 on mwdebug1002.eqiad.wmnet, but I imagine you want to wait the other one to test [23:12:22] (03PS1) 10Chad: Gerrit: Move master/slave detection to profile [puppet] - 10https://gerrit.wikimedia.org/r/344053 [23:12:25] mutante: Muchhhhhh nicer ^ [23:12:59] (03Merged) 10jenkins-bot: Enable RCFilters beta feature on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343436 (owner: 10Catrope) [23:13:18] (03CR) 10jenkins-bot: Enable RCFilters beta feature on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343436 (owner: 10Catrope) [23:13:46] RoanKattouw: 36 on mwdebug1002 too [23:13:50] mutante i also forgot. You can vote on changes even if they are merged now. [23:14:09] that can be done in polygerrit or gwt. [23:14:56] Did they make it possible to remove yourself after they're merged again? [23:14:58] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Move master/slave detection to profile [puppet] - 10https://gerrit.wikimedia.org/r/344053 (owner: 10Chad) [23:14:58] I miss that :( [23:15:05] Let me check [23:15:11] Dereckson: Thanks, checking [23:16:07] revi: maybe try "X-Spam-Status: Yes" [23:16:16] RainbowSprinkles looks like no but they allow you to add your self as a reviewer after it's merged though [23:16:28] ebernhardson: https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit-jessie/26395/ [23:16:31] Bleh, that's the opposite of what I want :( [23:16:50] revi: actually, nevermind, i don't see why it would fail though, sorry :/ [23:16:53] I hate when I'm cc'd on a patch, it merges, then there's a lengthy followup discussion and I can't remove myself [23:16:57] heh, k [23:17:10] yep [23:17:13] same here tho [23:17:25] revi: all i see is that the format changed some time in the past, it used to check for * (spam stars) and then for + [23:17:32] Dereckson: ? [23:17:54] ebernhardson: if mediawiki-extensions-qunit-jessie is voting, it won't be merged, as "karma" qunit test failed [23:18:09] Dereckson: 35+36 look good to me [23:18:15] Dereckson: the patches don't touch anything related to qunit, just recheck it [23:18:21] ebernhardson: ok [23:18:23] revi: maybe the regex thinks it must be X-Spam-Score immediately followed by 3 +'s, but in reality it is X-Spam-Score: and then +'s [23:18:31] (03PS2) 10Chad: Gerrit: Move master/slave detection to profile [puppet] - 10https://gerrit.wikimedia.org/r/344053 [23:18:35] ebernhardson: 66 passes qunit note [23:18:43] issue is only for 62 [23:18:46] RainbowSprinkles better idea quit notifications for the change. [23:18:48] hmm [23:19:36] paladox: That would work too [23:19:38] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Make rcenhancedfilters available as beta feature, enable on test wikis ([[Gerrit:343435]] + [[Gerrit:343436]]) (duration: 00m 51s) [23:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:47] Yep [23:19:57] :q [23:20:07] (03PS2) 10Dereckson: Enable Translate on beta.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344045 (https://phabricator.wikimedia.org/T160120) [23:20:18] (03CR) 10Dereckson: [C: 032] Enable Translate on beta.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344045 (https://phabricator.wikimedia.org/T160120) (owner: 10Dereckson) [23:21:21] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [23:21:29] RainbowSprinkles i filled the bug https://bugs.chromium.org/p/gerrit/issues/detail?id=5834 [23:21:37] though not really a bug but feature request :) [23:23:30] ebernhardson: Improve speed of scrolling results in comp suggest build on mwdebug1002, you want I pull it too on Terbium to test? [23:23:34] (03Merged) 10jenkins-bot: Enable Translate on beta.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344045 (https://phabricator.wikimedia.org/T160120) (owner: 10Dereckson) [23:23:42] (03CR) 10jenkins-bot: Enable Translate on beta.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344045 (https://phabricator.wikimedia.org/T160120) (owner: 10Dereckson) [23:24:08] Dereckson: it needs both patches to be useful [23:24:12] Translate on beta.wikiversity on mwdebug1002 too [23:24:17] Dereckson: its a maint script, so wont hurt anything [23:24:37] ebernhardson: ok, see you in 20 minutes then (Zuul, round 2) [23:25:03] (03CR) 10Volans: [C: 032] Better handling of item return codes [switchdc] - 10https://gerrit.wikimedia.org/r/343633 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [23:28:01] translate looks good on beta.wikiv [23:29:11] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable Translate on beta.wikiversity (T160120) (duration: 00m 45s) [23:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:18] T160120: Install Extension:Translate on beta.wikiversity - https://phabricator.wikimedia.org/T160120 [23:31:31] PROBLEM - puppet last run on mw1288 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:31:44] (03PS2) 10Dereckson: Enable Mapframe on sv.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344043 (https://phabricator.wikimedia.org/T161032) [23:31:53] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344043 (https://phabricator.wikimedia.org/T161032) (owner: 10Dereckson) [23:39:39] (03Merged) 10jenkins-bot: Enable Mapframe on sv.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344043 (https://phabricator.wikimedia.org/T161032) (owner: 10Dereckson) [23:39:48] (03CR) 10jenkins-bot: Enable Mapframe on sv.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344043 (https://phabricator.wikimedia.org/T161032) (owner: 10Dereckson) [23:40:04] mapframe on mwdebug1002 [23:41:23] RainbowSprinkles mutante /me updates the release notes for 2.14 https://gerrit-review.googlesource.com/#/c/100773/3/releases/2.14.md :) [23:42:09] i'm getting to the next gerrit change now [23:43:33] (03CR) 10Paladox: Gerrit: Move master/slave detection to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344053 (owner: 10Chad) [23:43:51] RainbowSprinkles: it says we are not supposed to use defaults with the Hiera calls, and i am not sure about the answer to that [23:43:57] yet [23:43:59] mapframe works [23:44:47] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable Mapframe on sv.wikipedia (T161032) (duration: 00m 43s) [23:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:53] T161032: Please turn on mapframe for Swedish Wikipedia - https://phabricator.wikimedia.org/T161032 [23:45:51] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [23:48:03] (03CR) 10Dzahn: Gerrit: Move master/slave detection to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344053 (owner: 10Chad) [23:48:24] ebernhardson: CirrusSearch totally enqueued, no test done, count 20 minuttes more [23:52:39] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3029672 (10faidon) I'm not sure yet exactly which configurations are affected by the SSD/memory shortage, but I'm wondering if it would affect this order by virtue of... [23:52:43] (03PS1) 10Dereckson: Allow translationadmin self-add for beta.wikiversity admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344060 (https://phabricator.wikimedia.org/T160120) [23:53:30] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344060 (https://phabricator.wikimedia.org/T160120) (owner: 10Dereckson) [23:53:31] Dereckson: i gotta grab a train, RoanKattouw is going to test my patch, i showed him how [23:53:35] (03CR) 10Chad: Gerrit: Move master/slave detection to profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344053 (owner: 10Chad) [23:53:48] it only effects maint scripts [23:54:46] (03PS2) 10Volans: Check that core DBs replica is in sync [switchdc] - 10https://gerrit.wikimedia.org/r/343627 (https://phabricator.wikimedia.org/T160178) [23:54:49] (03PS3) 10Volans: Add MediaWiki config tasks for ro/rw mode [switchdc] - 10https://gerrit.wikimedia.org/r/343858 (https://phabricator.wikimedia.org/T160178) [23:55:17] ebernhardson: ok, have a nice evening [23:59:31] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures