[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160818T0000). [00:01:17] (03PS1) 10Dzahn: site.pp: fix/add some node comments [puppet] - 10https://gerrit.wikimedia.org/r/305427 [00:05:29] (03PS2) 10Dzahn: site.pp: fix/add some node comments [puppet] - 10https://gerrit.wikimedia.org/r/305427 [00:06:00] (03CR) 10Dzahn: [C: 032] site.pp: fix/add some node comments [puppet] - 10https://gerrit.wikimedia.org/r/305427 (owner: 10Dzahn) [00:15:44] (03CR) 10Dzahn: [C: 031] "the only diff on carbon is the role name http://puppet-compiler.wmflabs.org/3749/carbon.wikimedia.org/ but it makes us flexible about usi" [puppet] - 10https://gerrit.wikimedia.org/r/305163 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [00:25:05] (03PS6) 10Dzahn: installserver: split DHCP part out into own role [puppet] - 10https://gerrit.wikimedia.org/r/305163 (https://phabricator.wikimedia.org/T132757) [00:27:24] !log maxsem@tin Finished scap: https://gerrit.wikimedia.org/r/#/c/305424/ (duration: 52m 22s) [00:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:15] (03PS4) 10Yuvipanda: dynamicproxy: puppetize appendfilename setting [puppet] - 10https://gerrit.wikimedia.org/r/304994 (owner: 10Giuseppe Lavagetto) [00:28:31] (03CR) 10Yuvipanda: [C: 032 V: 032] "I manually cherry picked and tested this on the tools puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/304994 (owner: 10Giuseppe Lavagetto) [00:32:27] (03PS1) 10Dzahn: DHCP: make configurable in Hiera which is the running server [puppet] - 10https://gerrit.wikimedia.org/r/305429 (https://phabricator.wikimedia.org/T132757) [00:34:05] (03PS7) 10Dzahn: installserver: split DHCP part out into own role [puppet] - 10https://gerrit.wikimedia.org/r/305163 (https://phabricator.wikimedia.org/T132757) [00:34:07] (03PS2) 10Dzahn: DHCP: make configurable in Hiera which is the running server [puppet] - 10https://gerrit.wikimedia.org/r/305429 (https://phabricator.wikimedia.org/T132757) [00:37:38] (03PS1) 10Dzahn: put installserver::dhcp on install1001, install2001 [puppet] - 10https://gerrit.wikimedia.org/r/305431 (https://phabricator.wikimedia.org/T132757) [01:22:06] (03CR) 10Yuvipanda: [C: 031] "Minor comments, but seems good otherwise!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/301505 (https://phabricator.wikimedia.org/T141014) (owner: 10BryanDavis) [01:50:12] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 1808.655521 Seconds [01:52:03] (03Draft2) 10MarcoAurelio: Fully restrict uploads on ms.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305436 (https://phabricator.wikimedia.org/T126944) [01:52:12] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 72.312707 Seconds [01:52:19] (03CR) 10jenkins-bot: [V: 04-1] Fully restrict uploads on ms.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305436 (https://phabricator.wikimedia.org/T126944) (owner: 10MarcoAurelio) [01:55:25] (03PS3) 10MarcoAurelio: Fully restrict uploads on ms.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305436 (https://phabricator.wikimedia.org/T126944) [02:04:10] (03PS4) 10MarcoAurelio: Fully restrict uploads on ms.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305436 (https://phabricator.wikimedia.org/T126944) [02:04:36] (03PS5) 10Liuxinyu970226: Fully restrict uploads on ms.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305436 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [02:10:28] (03PS6) 10MarcoAurelio: Fully restrict uploads on ms.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305436 (https://phabricator.wikimedia.org/T141227) [02:11:09] (03CR) 10MarcoAurelio: "Sorry for too many minor fixes. It's late here..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305436 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [02:29:13] PROBLEM - Varnishkafka log producer on cp3010 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [02:34:09] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.14) (duration: 11m 58s) [02:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:44:53] RECOVERY - Varnishkafka log producer on cp3010 is OK: PROCS OK: 1 process with command name varnishkafka [02:59:39] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.15) (duration: 08m 44s) [02:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:06:44] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Aug 18 03:06:44 UTC 2016 (duration 7m 5s) [03:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:34:13] 06Operations, 10Security-Reviews, 06Services, 06Services-next, 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2563267 (10GWicke) >>! In T142226#2540859, @GWicke wrote: > To get information on the relative frequency of... [03:43:09] 06Operations, 10Security-Reviews, 06Services, 06Services-next, 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2563269 (10ssastry) >>! In T142226#2563267, @GWicke wrote: >>>! In T142226#2540859, @GWicke wrote: >> To get... [04:28:02] (03CR) 10BryanDavis: Provision Striker via scap3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/301505 (https://phabricator.wikimedia.org/T141014) (owner: 10BryanDavis) [05:37:10] 06Operations, 10MediaWiki-extensions-CentralNotice, 10Traffic, 13Patch-For-Review: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2563345 (10Nemo_bis) [05:37:14] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Traffic, 13Patch-For-Review: ULS GeoIP should use the Cookie - https://phabricator.wikimedia.org/T143270#2562516 (10Nikerabbit) The cookie is already always used on WMF sites, because `$wgULSGeoService = false;` in CommonSettings.php. All ot... [05:43:28] (03CR) 10Nemo bis: "https://phabricator.wikimedia.org/T66582 is still restricted. Not fixed yet?" [puppet] - 10https://gerrit.wikimedia.org/r/136655 (https://bugzilla.wikimedia.org/64582) (owner: 10Ori.livneh) [05:44:44] 06Operations, 10Traffic: Varnish GeoIP is broken for HTTPS+IPv6 traffic - https://phabricator.wikimedia.org/T89688#2563362 (10Nemo_bis) [05:44:47] 06Operations, 10Traffic, 13Patch-For-Review: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2563361 (10Nemo_bis) [05:46:12] 06Operations, 10Traffic, 07HTTPS: Varnish GeoIP is broken for HTTPS+IPv6 traffic - https://phabricator.wikimedia.org/T89688#1042675 (10Nemo_bis) [06:36:05] (03CR) 10Muehlenhoff: [C: 031] "Looks good. Minor nit: Since you're not using $phabricator_servers elsewhere, you could just as well use" [puppet] - 10https://gerrit.wikimedia.org/r/305277 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [06:40:03] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 1 failures [06:40:41] 06Operations, 10Traffic, 13Patch-For-Review: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2563420 (10Nemo_bis) [06:46:36] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2563422 (10Nemo_bis) > I do see lots of legit referer headers. ULS uses it, for instance, and [hundreds of standalone wikis](https://wikiapi... [06:55:18] (03CR) 10Gilles: [C: 031] varnishmedia: remove dead code paths [puppet] - 10https://gerrit.wikimedia.org/r/305287 (owner: 10Ema) [07:05:42] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [07:15:15] (03CR) 10Muehlenhoff: [C: 031] "Looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/301149 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [07:18:22] (03PS3) 10Giuseppe Lavagetto: redis::instance: use specific aof/rdb file names by default [puppet] - 10https://gerrit.wikimedia.org/r/301789 (https://phabricator.wikimedia.org/T134400) [07:23:37] (03CR) 10Giuseppe Lavagetto: [C: 032] redis::instance: use specific aof/rdb file names by default [puppet] - 10https://gerrit.wikimedia.org/r/301789 (https://phabricator.wikimedia.org/T134400) (owner: 10Giuseppe Lavagetto) [07:28:52] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [07:29:22] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [07:31:01] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [07:31:32] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [07:36:26] seaborgium has enabled debug logs for slapd, it may create disk space issues soon [07:36:44] I've started by deleting the deb package cache [07:59:01] having a look, the current loglevel is around forever [08:03:18] ah, that's the bdb transction log [08:04:31] http://www.openldap.org/faq/index.cgi?_highlightWords=bdb&file=738 [08:07:24] I think is a recent thing [08:07:47] there is also a large file on var lib [08:08:08] it could have been forever but only creating a lot of entries recently, though [08:10:18] !log stoping mysql on db1042, db2009 for testing (both are depooled and alerts disabled) [08:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:11:41] there's the slapd-audit.log, which is currently 541 megs and then DBD log.0000xxxx log entries date back until Dec 8, they are 10 megs each and 1221 of them on serpens, so that adds up [08:18:21] PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: Puppet has 1 failures [08:18:45] (03CR) 10Alexandros Kosiaris: [C: 031] postgresql::server: fix service name on jessie. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/304456 (owner: 10Giuseppe Lavagetto) [08:26:21] (03CR) 10Faidon Liambotis: [C: 04-1] "Is this actually needed? These are session cookies, so this will only affect people that have kept their browser open. We can wait a coupl" [puppet] - 10https://gerrit.wikimedia.org/r/305419 (https://phabricator.wikimedia.org/T99226) (owner: 10BBlack) [08:36:46] (03PS1) 10Alexandros Kosiaris: puppetmaster: Only create puppetdb user on the master [puppet] - 10https://gerrit.wikimedia.org/r/305463 [08:40:06] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: Only create puppetdb user on the master [puppet] - 10https://gerrit.wikimedia.org/r/305463 (owner: 10Alexandros Kosiaris) [08:42:22] RECOVERY - puppet last run on nihal is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:52:57] (03PS1) 10Gehel: Postgresql - check_postgres_replication_lag.py requires psycopg [puppet] - 10https://gerrit.wikimedia.org/r/305466 [08:53:09] (03PS2) 10Dzahn: Delete videos.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/302873 [08:53:14] (03PS3) 10Faidon Liambotis: Delete videos.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/302873 (owner: 10Dzahn) [08:53:36] (03CR) 10Faidon Liambotis: [C: 032] Delete videos.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/302873 (owner: 10Dzahn) [08:54:28] (03PS2) 10Muehlenhoff: Support scaling of huge SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303548 (https://phabricator.wikimedia.org/T111815) [08:54:34] (03PS2) 10Dzahn: Delete strategyapps.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/302870 (https://phabricator.wikimedia.org/T31675) [08:54:38] (03PS3) 10Faidon Liambotis: Delete strategyapps.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/302870 (https://phabricator.wikimedia.org/T31675) (owner: 10Dzahn) [08:54:43] (03CR) 10Alexandros Kosiaris: [C: 032] Postgresql - check_postgres_replication_lag.py requires psycopg [puppet] - 10https://gerrit.wikimedia.org/r/305466 (owner: 10Gehel) [08:54:52] (03CR) 10Faidon Liambotis: [C: 032] Delete strategyapps.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/302870 (https://phabricator.wikimedia.org/T31675) (owner: 10Dzahn) [08:56:04] (03PS2) 10Faidon Liambotis: openldap: enable the memberof overlay [puppet] - 10https://gerrit.wikimedia.org/r/295357 (https://phabricator.wikimedia.org/T142817) [08:58:48] !log jmm@tin Synchronized wmf-config/CommonSettings.php: enable SVG scaling of huge files (duration: 00m 48s) [08:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:59:34] !log enabled scaling of huge SVGs on image scalers (T111815) [08:59:35] T111815: SVG files larger than 10 MB cannot be thumbnailed - https://phabricator.wikimedia.org/T111815 [08:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:01:12] (03PS1) 10Gehel: Postgresql - duplicate declaration of psycopg package, without appropriate guard. [puppet] - 10https://gerrit.wikimedia.org/r/305467 [09:02:53] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: puppet fail [09:03:10] akosiaris: there was a double declaration of psycopg for labsdb node in my previous change (catched by puppet-compiler). I'm checking the fix right now... [09:03:46] (03CR) 10Gehel: [C: 032] Postgresql - duplicate declaration of psycopg package, without appropriate guard. [puppet] - 10https://gerrit.wikimedia.org/r/305467 (owner: 10Gehel) [09:06:51] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:07:50] hmm, Phab IRC bot is down? [09:16:33] gehel: thanks for the fix on psycopg2.. I was too quick to merge [09:16:42] akosiaris: no problem... [09:17:06] akosiaris: thanks to you for the monitoring! I now see problem with maps1003... [09:17:58] akosiaris: still trying to figure out what this is. Check is green, but returns "Rep Delay is: None Second". Quick check to the logs indicate there is a replication issue. [09:18:04] I'll dig into it... [09:22:46] gehel: you are running it on a master [09:23:03] at least that's the error you are going to get if running it on a master [09:23:08] could it be that ? [09:23:32] as in the slave was never properly initialized [09:23:40] * akosiaris damn postgres replication... sucks [09:24:00] akosiaris: everything is possible :P But I don't think that's the case [09:24:15] the postgres logs show "invalid record length at 14E/87000130", so it seems there is an issue with replication [09:24:43] akosiaris: so yes, the slave might have never been properly initialized. [09:27:41] (03PS1) 10Gehel: Graphite hourly schema [puppet] - 10https://gerrit.wikimedia.org/r/305470 [09:28:39] (03PS3) 10Filippo Giunchedi: Remove statsdlb, unreferenced now [puppet] - 10https://gerrit.wikimedia.org/r/282357 (owner: 10Faidon Liambotis) [09:29:13] (03PS2) 10Gehel: Graphite hourly / daily schema [puppet] - 10https://gerrit.wikimedia.org/r/305470 [09:30:00] (03CR) 10Filippo Giunchedi: [C: 032] Remove statsdlb, unreferenced now [puppet] - 10https://gerrit.wikimedia.org/r/282357 (owner: 10Faidon Liambotis) [09:38:17] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, also what Moritz said" [puppet] - 10https://gerrit.wikimedia.org/r/305277 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [09:39:57] (03PS1) 10Gehel: Maps - increase Posgresql max_wal_sender [puppet] - 10https://gerrit.wikimedia.org/r/305471 [09:44:50] (03PS2) 10Ema: varnishmedia: remove dead code paths [puppet] - 10https://gerrit.wikimedia.org/r/305287 [09:45:00] (03CR) 10Ema: [C: 032 V: 032] varnishmedia: remove dead code paths [puppet] - 10https://gerrit.wikimedia.org/r/305287 (owner: 10Ema) [10:00:17] (03CR) 10Mobrovac: [C: 04-1] "LGTM overall, some comments in-lined." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/305414 (owner: 10Ppchelko) [10:01:13] (03CR) 10Alexandros Kosiaris: "feel free to test this, but since this is the only override of the default value I 'd say let's just increase the default value." [puppet] - 10https://gerrit.wikimedia.org/r/305471 (owner: 10Gehel) [10:03:42] (03CR) 10Faidon Liambotis: [C: 04-1] "Awesome work. Thank you so much for working on this, rewriting it properly and cleaning up my & upstream's messy code :) A few comments in" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [10:20:33] !log upgrading remaining canary application servers to hhvm 3.12.7 [10:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:21:22] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [10:22:01] (03PS4) 10Filippo Giunchedi: hieradata: add thumbor swift account [puppet] - 10https://gerrit.wikimedia.org/r/305275 (https://phabricator.wikimedia.org/T139606) [10:23:22] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:30:12] (03PS1) 10Filippo Giunchedi: thumbor: add instance name to syslog lines [puppet] - 10https://gerrit.wikimedia.org/r/305474 [10:32:28] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: add instance name to syslog lines [puppet] - 10https://gerrit.wikimedia.org/r/305474 (owner: 10Filippo Giunchedi) [10:47:05] (03PS1) 10Filippo Giunchedi: thumbor: get realservers_ip from hiera [puppet] - 10https://gerrit.wikimedia.org/r/305475 [10:51:32] !log cr1-eqiad: deprioritizing all groups (priority=50) [10:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:38] (03PS2) 10Filippo Giunchedi: thumbor: get realservers_ip from hiera [puppet] - 10https://gerrit.wikimedia.org/r/305475 [10:55:54] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: get realservers_ip from hiera [puppet] - 10https://gerrit.wikimedia.org/r/305475 (owner: 10Filippo Giunchedi) [10:58:02] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [11:00:05] paravoid: Dear anthropoid, the time has come. Please deploy network maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160818T1100). [11:00:13] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [11:01:45] (03CR) 10Filippo Giunchedi: [C: 031] "I'm ok with it in general, note though that suffixes ATM are also used for statsd derived metrics (max/min/etc) so it might create some co" [puppet] - 10https://gerrit.wikimedia.org/r/305470 (owner: 10Gehel) [11:04:20] !log cr1-eqiad: disabling xe-4/2/0 (link to cr1-codfw) and xe-4/2/2 (link to cr2-knams) [11:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:06:41] !log cr1-eqiad: deactivating BGP sessions with PyBal/LVS [11:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:31:12] !log uploaded new Linux package for jessie-wikimedia to carbon (now based on 4.4.18) [11:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:36:15] (03PS1) 10Alexandros Kosiaris: check_puppetrun: Display the admin set message [puppet] - 10https://gerrit.wikimedia.org/r/305482 [11:36:17] (03PS1) 10Alexandros Kosiaris: check_puppetrun: Move the failure checks at the top [puppet] - 10https://gerrit.wikimedia.org/r/305483 [11:36:19] (03PS1) 10Alexandros Kosiaris: check_puppetrun: Remove statefile usage [puppet] - 10https://gerrit.wikimedia.org/r/305484 [11:36:21] (03PS1) 10Alexandros Kosiaris: check_puppetrun: Remove unused lastrun_failed var [puppet] - 10https://gerrit.wikimedia.org/r/305485 [11:36:23] (03PS1) 10Alexandros Kosiaris: check_puppetrun: Remove old failure handling code [puppet] - 10https://gerrit.wikimedia.org/r/305486 [11:36:25] (03PS1) 10Alexandros Kosiaris: check_puppetrun: Add reportfile handling [puppet] - 10https://gerrit.wikimedia.org/r/305487 [11:36:33] (03PS24) 10Alexandros Kosiaris: Logstash: Enable log4j provider [puppet] - 10https://gerrit.wikimedia.org/r/302601 (https://phabricator.wikimedia.org/T141324) (owner: 10Chad) [11:36:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Logstash: Enable log4j provider [puppet] - 10https://gerrit.wikimedia.org/r/302601 (https://phabricator.wikimedia.org/T141324) (owner: 10Chad) [11:42:55] (03CR) 10jenkins-bot: [V: 04-1] check_puppetrun: Add reportfile handling [puppet] - 10https://gerrit.wikimedia.org/r/305487 (owner: 10Alexandros Kosiaris) [11:43:35] grrr, damn rubocop [11:44:31] (03PS3) 10Gehel: Graphite hourly / daily schema [puppet] - 10https://gerrit.wikimedia.org/r/305470 [11:44:40] (03PS2) 10Alexandros Kosiaris: check_puppetrun: Add reportfile handling [puppet] - 10https://gerrit.wikimedia.org/r/305487 [11:45:59] (03CR) 10Gehel: [C: 032] Graphite hourly / daily schema [puppet] - 10https://gerrit.wikimedia.org/r/305470 (owner: 10Gehel) [11:47:39] (03PS2) 10Gehel: Maps - increase Posgresql max_wal_sender [puppet] - 10https://gerrit.wikimedia.org/r/305471 [11:47:39] !log cr1-eqiad: deactivating Transit4/6 and Private-Peer4/6 BGP sessions [11:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:49:05] akosiaris: not any major stuff please, in the middle of the cr1-eqiad upgrade [11:49:12] not sure if your commit qualifies [11:49:14] gehel: too :) [11:49:32] paravoid: I am aware [11:49:39] * akosiaris crosses fingers [11:49:56] * ema too [11:54:02] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 40 probes of 244 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:54:46] !log cr1-eqiad: deactivating PyBal/LVS backup static routes [11:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:56:07] !log cr1-eqiad: deactivating Fundraising BGP session [11:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:56:49] paravoid: ack [11:57:21] akosiaris: you have a minute to chat about postgresql replication? [12:00:11] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 12 probes of 244 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:01:23] cr1-eqiad: disabling Transit/Fundraising interfaces [12:01:25] er [12:01:27] !log cr1-eqiad: disabling Transit/Fundraising interfaces [12:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:04:35] cr1-eqiad: disabling all asw row A interfaces [12:04:37] !log cr1-eqiad: disabling all asw row A interfaces [12:04:39] damn :) [12:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:05:42] !log cr1-eqiad: disabling all asw row B/C interfaces [12:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:06:33] PROBLEM - Host cr1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:07:03] PROBLEM - Apache HTTP on mw1257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:07:11] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 107, down: 1, dormant: 0, excluded: 2, unused: 0BRxe-6/0/0: down - Core: cr1-eqiad:xe-5/0/3BR [12:08:32] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:01] RECOVERY - Apache HTTP on mw1257 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 3.035 second response time [12:09:01] PROBLEM - puppet last run on db1091 is CRITICAL: Timeout while attempting connection [12:09:11] PROBLEM - ElasticSearch health check for shards on elastic1047 is CRITICAL: CRITICAL - elasticsearch inactive shards 2652 threshold =0.1% breach: status: red, number_of_nodes: 24, unassigned_shards: 2582, number_of_pending_tasks: 359, number_of_in_flight_fetch: 25795, timed_out: False, active_primary_shards: 2997, task_max_waiting_in_queue_millis: 255953, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [12:09:22] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:22] PROBLEM - ElasticSearch health check for shards on elastic1020 is CRITICAL: CRITICAL - elasticsearch inactive shards 2652 threshold =0.1% breach: status: red, number_of_nodes: 24, unassigned_shards: 2582, number_of_pending_tasks: 370, number_of_in_flight_fetch: 25749, timed_out: False, active_primary_shards: 2997, task_max_waiting_in_queue_millis: 273355, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [12:09:26] things are quite slow [12:09:31] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:32] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:32] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:41] PROBLEM - DPKG on cp1074 is CRITICAL: Timeout while attempting connection [12:09:44] Hey guys? Did someone just break login? [12:09:51] paravoid: ^ ? [12:09:51] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:52] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:53] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:56] PROBLEM - YARN NodeManager Node-State on analytics1036 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:56] I suppose related ? [12:10:00] PROBLEM - HTTPS on cp1074 is CRITICAL: Return code of 110 is out of bounds [12:10:21] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:10:21] PROBLEM - Host cr1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:10:32] PROBLEM - ElasticSearch health check for shards on elastic1021 is CRITICAL: CRITICAL - elasticsearch inactive shards 3455 threshold =0.1% breach: status: red, number_of_nodes: 21, unassigned_shards: 3389, number_of_pending_tasks: 592, number_of_in_flight_fetch: 22303, timed_out: False, active_primary_shards: 2892, task_max_waiting_in_queue_millis: 337090, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [12:10:32] PROBLEM - ElasticSearch health check for shards on elastic1033 is CRITICAL: CRITICAL - elasticsearch inactive shards 3758 threshold =0.1% breach: status: red, number_of_nodes: 19, unassigned_shards: 3698, number_of_pending_tasks: 650, number_of_in_flight_fetch: 22197, timed_out: False, active_primary_shards: 2862, task_max_waiting_in_queue_millis: 339576, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [12:10:32] PROBLEM - ElasticSearch health check for shards on elastic1042 is CRITICAL: CRITICAL - elasticsearch inactive shards 3758 threshold =0.1% breach: status: red, number_of_nodes: 19, unassigned_shards: 3698, number_of_pending_tasks: 650, number_of_in_flight_fetch: 22743, timed_out: False, active_primary_shards: 2862, task_max_waiting_in_queue_millis: 339834, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [12:10:33] PROBLEM - ElasticSearch health check for shards on elastic1025 is CRITICAL: CRITICAL - elasticsearch inactive shards 4064 threshold =0.1% breach: status: red, number_of_nodes: 18, unassigned_shards: 4007, number_of_pending_tasks: 759, number_of_in_flight_fetch: 12292, timed_out: False, active_primary_shards: 2809, task_max_waiting_in_queue_millis: 343940, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [12:10:42] that's ES breaking again? [12:10:42] PROBLEM - ElasticSearch health check for shards on elastic1040 is CRITICAL: CRITICAL - elasticsearch inactive shards 4011 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 4007, number_of_pending_tasks: 805, number_of_in_flight_fetch: 9503, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 352341, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:10:42] PROBLEM - ElasticSearch health check for shards on elastic1031 is CRITICAL: CRITICAL - elasticsearch inactive shards 4011 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 4007, number_of_pending_tasks: 805, number_of_in_flight_fetch: 9503, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 352336, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:10:42] PROBLEM - ElasticSearch health check for shards on elastic1035 is CRITICAL: CRITICAL - elasticsearch inactive shards 4011 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 4007, number_of_pending_tasks: 805, number_of_in_flight_fetch: 9503, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 352393, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:10:42] PROBLEM - ElasticSearch health check for shards on elastic1039 is CRITICAL: CRITICAL - elasticsearch inactive shards 4011 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 4007, number_of_pending_tasks: 806, number_of_in_flight_fetch: 9195, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 352556, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:10:51] PROBLEM - ElasticSearch health check for shards on elastic1027 is CRITICAL: CRITICAL - elasticsearch inactive shards 4011 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 4007, number_of_pending_tasks: 836, number_of_in_flight_fetch: 8859, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 361194, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:10:52] PROBLEM - ElasticSearch health check for shards on elastic1023 is CRITICAL: CRITICAL - elasticsearch inactive shards 4011 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 4007, number_of_pending_tasks: 839, number_of_in_flight_fetch: 8859, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 361873, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:10:52] PROBLEM - ElasticSearch health check for shards on elastic1046 is CRITICAL: CRITICAL - elasticsearch inactive shards 4011 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 4007, number_of_pending_tasks: 839, number_of_in_flight_fetch: 8859, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 361914, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:10:52] PROBLEM - ElasticSearch health check for shards on elastic1022 is CRITICAL: CRITICAL - elasticsearch inactive shards 4011 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 4007, number_of_pending_tasks: 839, number_of_in_flight_fetch: 8859, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 361934, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:10:52] PROBLEM - ElasticSearch health check for shards on elastic1037 is CRITICAL: CRITICAL - elasticsearch inactive shards 4011 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 4007, number_of_pending_tasks: 839, number_of_in_flight_fetch: 8859, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 361934, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:10:53] PROBLEM - ElasticSearch health check for shards on elastic1026 is CRITICAL: CRITICAL - elasticsearch inactive shards 4011 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 4007, number_of_pending_tasks: 841, number_of_in_flight_fetch: 8859, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 362086, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:10:53] PROBLEM - ElasticSearch health check for shards on elastic1044 is CRITICAL: CRITICAL - elasticsearch inactive shards 4011 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 4007, number_of_pending_tasks: 841, number_of_in_flight_fetch: 8859, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 362105, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:10:54] PROBLEM - ElasticSearch health check for shards on elastic1017 is CRITICAL: CRITICAL - elasticsearch inactive shards 4011 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 4007, number_of_pending_tasks: 841, number_of_in_flight_fetch: 8859, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 362151, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:10:54] PROBLEM - ElasticSearch health check for shards on elastic1024 is CRITICAL: CRITICAL - elasticsearch inactive shards 4011 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 4007, number_of_pending_tasks: 842, number_of_in_flight_fetch: 8859, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 362785, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:11:03] RECOVERY - puppet last run on db1091 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures [12:11:13] PROBLEM - ElasticSearch health check for shards on elastic1041 is CRITICAL: CRITICAL - elasticsearch inactive shards 4010 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 3916, number_of_pending_tasks: 914, number_of_in_flight_fetch: 10983, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 384398, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [12:11:13] gehel: ^^^ [12:11:21] PROBLEM - ElasticSearch health check for shards on elastic1029 is CRITICAL: CRITICAL - elasticsearch inactive shards 4010 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 3916, number_of_pending_tasks: 921, number_of_in_flight_fetch: 10765, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 387232, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [12:11:21] PROBLEM - ElasticSearch health check for shards on elastic1019 is CRITICAL: CRITICAL - elasticsearch inactive shards 4010 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 3916, number_of_pending_tasks: 924, number_of_in_flight_fetch: 10765, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 387918, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [12:11:21] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:21] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:21] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:22] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:31] Commons is working, enwiki tis not. [12:11:32] PROBLEM - ElasticSearch health check for shards on elastic1032 is CRITICAL: CRITICAL - elasticsearch inactive shards 4010 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 3916, number_of_pending_tasks: 959, number_of_in_flight_fetch: 9819, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 402854, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:11:32] PROBLEM - ElasticSearch health check for shards on elastic1038 is CRITICAL: CRITICAL - elasticsearch inactive shards 4010 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 3916, number_of_pending_tasks: 959, number_of_in_flight_fetch: 9819, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 402894, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:11:32] PROBLEM - ElasticSearch health check for shards on elastic1036 is CRITICAL: CRITICAL - elasticsearch inactive shards 4010 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 3916, number_of_pending_tasks: 959, number_of_in_flight_fetch: 9819, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 403033, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:11:32] PROBLEM - ElasticSearch health check for shards on elastic1034 is CRITICAL: CRITICAL - elasticsearch inactive shards 4010 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 3916, number_of_pending_tasks: 959, number_of_in_flight_fetch: 9819, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 403064, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:11:32] PROBLEM - ElasticSearch health check for shards on elastic1030 is CRITICAL: CRITICAL - elasticsearch inactive shards 4010 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 3916, number_of_pending_tasks: 959, number_of_in_flight_fetch: 9819, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 403197, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:11:33] PROBLEM - ElasticSearch health check for shards on elastic1045 is CRITICAL: CRITICAL - elasticsearch inactive shards 4010 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 3916, number_of_pending_tasks: 959, number_of_in_flight_fetch: 9819, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 403253, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:11:33] PROBLEM - ElasticSearch health check for shards on elastic1018 is CRITICAL: CRITICAL - elasticsearch inactive shards 4010 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 3916, number_of_pending_tasks: 959, number_of_in_flight_fetch: 9819, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 403284, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:11:33] PROBLEM - ElasticSearch health check for shards on elastic1028 is CRITICAL: CRITICAL - elasticsearch inactive shards 4010 threshold =0.1% breach: status: red, number_of_nodes: 28, unassigned_shards: 3916, number_of_pending_tasks: 959, number_of_in_flight_fetch: 9819, timed_out: False, active_primary_shards: 2812, task_max_waiting_in_queue_millis: 403315, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percen [12:11:42] PROBLEM - Hadoop HDFS Zookeeper failover controller on analytics1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController [12:11:45] PROBLEM - YARN NodeManager Node-State on analytics1056 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:48] PROBLEM - YARN NodeManager Node-State on analytics1049 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:51] PROBLEM - YARN NodeManager Node-State on analytics1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:55] RECOVERY - DPKG on cp1074 is OK: All packages OK [12:11:56] PROBLEM - YARN NodeManager Node-State on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:58] PROBLEM - YARN NodeManager Node-State on analytics1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:01] PROBLEM - YARN NodeManager Node-State on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:04] PROBLEM - YARN NodeManager Node-State on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:07] PROBLEM - YARN NodeManager Node-State on analytics1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:11] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:15] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [12:12:15] PROBLEM - traffic-pool service on cp1071 is CRITICAL: Timeout while attempting connection [12:12:25] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:12:34] RECOVERY - HTTPS on cp1074 is OK: SSLXNN OK - 36 OK [12:12:34] PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1002.eqiad.wmnet because of too many down! [12:12:34] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [12:12:39] PROBLEM - ElasticSearch health check for shards on elastic1043 is CRITICAL: CRITICAL - elasticsearch inactive shards 4633 threshold =0.1% breach: status: red, number_of_nodes: 24, unassigned_shards: 4535, number_of_pending_tasks: 1365, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 2646, task_max_waiting_in_queue_millis: 467509, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_ [12:12:44] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [12:12:55] PROBLEM - puppet last run on mc1015 is CRITICAL: CRITICAL: Puppet has 2 failures [12:13:14] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] [12:13:15] RECOVERY - Host cr1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.64 ms [12:13:15] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [12:13:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [12:13:45] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [12:13:45] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [12:13:45] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [12:13:49] !log cr1-eqiad: reenabling all asw row A/B/C interfaces [12:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:13:55] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [12:14:05] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [12:14:06] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [12:14:06] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [12:14:06] RECOVERY - YARN NodeManager Node-State on analytics1049 is OK: OK: YARN NodeManager analytics1049.eqiad.wmnet:8041 Node-State: RUNNING [12:14:09] RECOVERY - YARN NodeManager Node-State on analytics1048 is OK: OK: YARN NodeManager analytics1048.eqiad.wmnet:8041 Node-State: RUNNING [12:14:14] RECOVERY - YARN NodeManager Node-State on analytics1056 is OK: OK: YARN NodeManager analytics1056.eqiad.wmnet:8041 Node-State: RUNNING [12:14:17] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [12:14:17] Revent: network maintenance in progress [12:14:24] RECOVERY - YARN NodeManager Node-State on analytics1053 is OK: OK: YARN NodeManager analytics1053.eqiad.wmnet:8041 Node-State: RUNNING [12:14:27] RECOVERY - YARN NodeManager Node-State on analytics1052 is OK: OK: YARN NodeManager analytics1052.eqiad.wmnet:8041 Node-State: RUNNING [12:14:30] RECOVERY - YARN NodeManager Node-State on analytics1046 is OK: OK: YARN NodeManager analytics1046.eqiad.wmnet:8041 Node-State: RUNNING [12:14:33] RECOVERY - YARN NodeManager Node-State on analytics1029 is OK: OK: YARN NodeManager analytics1029.eqiad.wmnet:8041 Node-State: RUNNING [12:14:36] RECOVERY - YARN NodeManager Node-State on analytics1050 is OK: OK: YARN NodeManager analytics1050.eqiad.wmnet:8041 Node-State: RUNNING [12:14:40] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [12:14:43] RECOVERY - traffic-pool service on cp1071 is OK: OK - traffic-pool is active [12:14:43] PROBLEM - puppet last run on es1018 is CRITICAL: CRITICAL: puppet fail [12:14:48] oh ok [12:15:05] RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy [12:15:05] RECOVERY - YARN NodeManager Node-State on analytics1036 is OK: OK: YARN NodeManager analytics1036.eqiad.wmnet:8041 Node-State: RUNNING [12:15:15] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] [12:15:24] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [12:15:35] PROBLEM - Kafka Broker Server on kafka1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [12:15:56] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:15:56] PROBLEM - puppet last run on cp1074 is CRITICAL: CRITICAL: Puppet has 1 failures [12:15:59] paravoid: checking... [12:16:07] gehel: thanks (cc: ema) [12:16:10] <_joe_> restbase went down? [12:16:26] <_joe_> let me get back to my desk [12:16:27] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Puppet has 1 failures [12:16:38] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] [12:17:07] RECOVERY - Host cr1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.78 ms [12:17:12] also dbproxy1010, but I think that is passive [12:17:56] RECOVERY - Kafka Broker Server on kafka1020 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [12:19:27] RECOVERY - Hadoop HDFS Zookeeper failover controller on analytics1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController [12:19:32] hi [12:19:36] I get Can't connect to MySQL server on '10.64.0.205' (4) (10.64.0.205)) [12:19:47] yannf, doing what? [12:19:48] https://fr.wikisource.org/w/index.php?title=Livre:Reclus_-_La_Commune_de_Paris_au_jour_le_jour.djvu&action=purge [12:20:20] it works now, but... [12:20:27] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [12:20:35] it asks for confirmation [12:20:45] which it didn't before [12:20:49] yannf, we just had a network glitch (it may be related or not, we do not know yet) [12:20:59] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:21:28] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [12:21:28] we are investigating [12:21:38] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [12:21:39] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [12:21:48] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [12:22:08] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:23:17] we definitely had 30K queries failing between 12:04 and 12:12 [12:23:53] (03PS1) 10Gehel: Switching search traffic to codfw as eqiad seems unstable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305490 [12:23:58] that's useful jynus, thanks [12:24:13] top host pc1006.eqiad.wmnet [12:24:18] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [12:24:24] queries doesn't necessarily mean requests [12:24:24] (03CR) 10jenkins-bot: [V: 04-1] Switching search traffic to codfw as eqiad seems unstable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305490 (owner: 10Gehel) [12:24:33] some of those I think are retried [12:24:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:24:41] probably most [12:25:22] (03PS2) 10Gehel: Switching search traffic to codfw as eqiad seems unstable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305490 [12:25:28] then db1065 and db1066 [12:25:38] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] [12:26:40] (03CR) 10Gehel: [C: 032] Switching search traffic to codfw as eqiad seems unstable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305490 (owner: 10Gehel) [12:27:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:31:26] (03CR) 10Alexandros Kosiaris: [C: 031] Maps - increase Posgresql max_wal_sender [puppet] - 10https://gerrit.wikimedia.org/r/305471 (owner: 10Gehel) [12:31:33] (03CR) 10BBlack: "CN still has backwards-compat code for the old session cookies (before Region was added) from a year ago :)." [puppet] - 10https://gerrit.wikimedia.org/r/305419 (https://phabricator.wikimedia.org/T99226) (owner: 10BBlack) [12:36:08] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:36:17] (03CR) 10Faidon Liambotis: "My point with the people that keep browsers open for a long time was that a) they'll be a minority (esp. the overlap with IPv6-enabled use" [puppet] - 10https://gerrit.wikimedia.org/r/305419 (https://phabricator.wikimedia.org/T99226) (owner: 10BBlack) [12:37:19] RECOVERY - puppet last run on mc1015 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [12:39:48] RECOVERY - puppet last run on cp1074 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:40:15] !log switching search traffic to codfw [12:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:40:21] yannf, so there was definitely a network glitch that made some queries fail [12:40:58] RECOVERY - puppet last run on es1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:40:59] !log gehel@tin Synchronized wmf-config/InitialiseSettings.php: switching search to codfw (duration: 00m 56s) [12:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:41:18] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-urd-hin] - 10https://gerrit.wikimedia.org/r/296368 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [12:43:58] !log cr1-eqiad: disabling all asw row D interfaces [12:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:44:40] If it is at all helpful, Commons is still working normally. [12:44:48] thanks Revent [12:45:14] [V7WtjQpAADwAAZV-nOoAAAAN] 2016-08-18 12:44:01: Fatal exception of type "MWException" <- when attempting to login to enwiki [12:45:47] Er… commons was working, it just logged me out. [12:45:56] now? [12:46:01] ffs [12:46:11] Ok… false alarm. [12:46:13] did it recover? [12:46:18] it wasn't false, it was just very brief [12:46:21] <_joe_> maybe his session is stored in row D [12:46:38] A failed login attempt to enwiki logged me out of Commons. I was able to log back in. [12:46:51] <_joe_> Revent: you had to log back in? [12:47:13] <_joe_> yeah, logging out clears the cookies [12:47:13] To Commons just now, but I kinda suspect enwiki removed my cookie. [12:47:34] <_joe_> so of course you had to relogin upon a failure to communicate with the backend [12:47:54] you get globally logged out because of a temporary minor hiccup when retrieving your session? [12:47:57] that's just broken [12:48:26] <_joe_> I don't disagree [12:48:35] <_joe_> I'm just trying to make sense of what we're seeing [12:48:37] As long as I don’t try to actually ‘login’ to enwiki, I stay logged in to Commons (and can still ‘log in’) and see a logged out enwiki [12:48:39] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1017 is CRITICAL: CRITICAL - elasticsearch inactive shards 2007 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1917, number_of_pending_tasks: 12, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 1849, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_p [12:48:40] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1018 is CRITICAL: CRITICAL - elasticsearch inactive shards 2021 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1931, number_of_pending_tasks: 52, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 4435, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_p [12:48:40] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1019 is CRITICAL: CRITICAL - elasticsearch inactive shards 2022 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1932, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [12:48:40] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1020 is CRITICAL: CRITICAL - elasticsearch inactive shards 2021 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1931, number_of_pending_tasks: 59, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 5179, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_p [12:48:40] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1021 is CRITICAL: CRITICAL - elasticsearch inactive shards 2013 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1923, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [12:48:40] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1022 is CRITICAL: CRITICAL - elasticsearch inactive shards 2007 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1917, number_of_pending_tasks: 15, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 2127, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_p [12:48:45] yeah I know [12:48:52] what was that? gehel, that you? [12:48:57] <_joe_> I guess so [12:49:09] paravoid: yes, acknowledging the issue [12:49:14] they weren't at PROBLEM, before though? [12:49:17] er [12:49:19] CRITICAL I mean [12:49:35] paravoid: they were [12:49:37] they were critical [12:49:46] they weren't in my icinga view 5 minutes ago [12:49:57] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 187, down: 4, dormant: 0, excluded: 0, unused: 0BRae4.32767: down - BRae4.1004: down - Subnet public1-d-eqiadBRae4.1023: down - Subnet analytics1-d-eqiadBRae4.1020: down - Subnet private1-d-eqiadBR [12:50:08] otrs-wiki is also apparently ok. [12:50:09] jynus: what's with the dbproxy1010 haproxy alert? [12:50:24] it means that it detected the primary master as down [12:50:32] (the cr1-eqiad alert is to be expected, no point in ack'ing it because it'll change soon) [12:50:34] and is sending data to the slave [12:50:44] which one is down? [12:51:02] oh, db is not down anymore, but we do not un-failover automatically [12:51:11] oh ok [12:51:11] just alert and let the operatos handle it [12:51:27] I'll proceed with the rest then [12:51:28] it prevents flopping, I think it is a good alternative [12:52:08] but ignore, buecase by change the proxy was not in use (remember I overdo the redundancy of the proxies) [12:52:18] let's move on [12:52:29] yes, I will fix that in a second [12:52:31] !log cr1-eqiad: disabling all asw row A-C interfaces [12:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:53:05] ok [12:53:13] all cr1 interfaces besides the one to cr2 are down [12:53:50] !log cr1-eqiad: deactivate chassis redundancy graceful-switchover [12:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:56:44] !log cr2-eqiad: remove VRRPv3 backwards compatibility (delete protocols vrrp checksum-without-pseudoheader) [12:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:57:12] ok [12:57:14] proceeding with the upgrade [12:57:19] halfway into our window [12:57:25] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/294675 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [12:58:32] (03PS1) 10KartikMistry: Revert "apertium-hbs: Rebuild for Jessie and other fixes" [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/305495 [12:58:48] (03CR) 10jenkins-bot: [V: 04-1] Revert "apertium-hbs: Rebuild for Jessie and other fixes" [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/305495 (owner: 10KartikMistry) [12:58:49] !log cr1-eqiad: upgrading re0 and rebooting [12:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:25] akosiaris: around? [12:59:49] akosiaris: can you merge, https://gerrit.wikimedia.org/r/#/c/305495/ ? It seems, https://gerrit.wikimedia.org/r/#/c/294675/ messed up repo. [12:59:59] or whatever is the best. [13:00:02] !log reloaded dbproxy1010's haproxy service to point to the original master [13:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:00:38] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:59] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:08] PROBLEM - puppet last run on ms-be2022 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:15] kart_: argh, how was that even pushed ? it had a clear -1 [13:01:28] actually 2 -1s [13:01:36] akosiaris: seems wrong branch. [13:02:03] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "apertium-hbs: Rebuild for Jessie and other fixes" [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/305495 (owner: 10KartikMistry) [13:02:08] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:09] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:14] I imported new upstream in the branch Debian instead of master, that thought Gerrit that change is merged. [13:02:27] we really need a puppet check that checks for catalog failures, not just random puppetmaster disconnects [13:02:40] akosiaris: check if it is really merged or not. [13:02:47] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:02:48] maybe with a puppet report handler [13:02:56] ema: you were dealing with that last, right? :) [13:03:08] paravoid: partly already done in https://gerrit.wikimedia.org/r/#/c/305487/ [13:03:26] cool! [13:03:28] reports catalog failures at least [13:03:28] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:03:29] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:03:32] paravoid: this is how far I went https://gerrit.wikimedia.org/r/#/c/298921/ [13:03:33] what's with the 500s [13:03:34] it still reports CRITICAL [13:03:57] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:06:02] those 5xx are universal [13:06:09] e.g. upload/maps affected with the same spike shape [13:06:30] so it's 503 varnish<->varnish regardless of cluster [13:07:10] 503s *between* varnishes? [13:07:22] I confirm do not see anything from mediawiki errors [13:07:54] mark: I don't know to which backend, but the distinction I'm drawing is this isn't because of some failure down in MW/DB/RB [13:08:00] this is network-induced 503 [13:08:48] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [13:08:59] could this be the alarm from before? it is known to have a lot of delay [13:09:29] centers on 12:56, very tiny spike. same basic thing as what happened at ~12:00, but the previous spike was slightly fatter in shape [13:09:47] weird [13:09:48] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:09:49] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:09:51] ok, sorry, graphana hasn't reach there [13:09:52] i'm not seeing any interfaces running hot [13:10:18] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [13:10:18] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:10:24] (03PS1) 10KartikMistry: apertium-hbs: New upstream snapshot and rebuild [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/305498 (https://phabricator.wikimedia.org/T107306) [13:10:37] (03CR) 10jenkins-bot: [V: 04-1] apertium-hbs: New upstream snapshot and rebuild [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/305498 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [13:10:38] akosiaris: resubmitted clean patch. [13:10:46] OK. it still fails. [13:10:58] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:11:17] the 12:56-ish 503's line up with the puppetfails around 13:00 too, those would be from temporary network glitch a few mins before [13:11:35] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/305498 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [13:11:45] either way it's a temporary glitch [13:11:53] but it's not a glitch-free switch for sure [13:12:17] could have been row D arp expiring [13:12:22] as vrrp was broken [13:12:35] (03PS5) 10Filippo Giunchedi: hieradata: add thumbor swift account [puppet] - 10https://gerrit.wikimedia.org/r/305275 (https://phabricator.wikimedia.org/T139606) [13:12:37] not ARP [13:12:43] it's the same MAC due to VRRP [13:12:48] the switch's mac address table maybe [13:13:21] although faidon disabled row D at 12:43 already [13:16:37] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [13:18:33] !log cr1-eqiad: toggling mastership between routing-engines (re1->re0) [13:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:44] akosiaris: /tmp/hudson2864184890027879912.sh: line 5: /usr/bin/lintian-junit-report: No such file or directory - what is this? [13:18:56] https://integration.wikimedia.org/ci/job/debian-glue/550/console [13:19:36] (03CR) 10KartikMistry: "13:14:19 + /usr/bin/lintian-junit-report --filename lintian-binary.txt apertium-hbs_0.5.0~r68212-1+wmf1+0~20160818131144.550+jessie+wikime" [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/305498 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [13:20:21] !log cr1-eqiad: upgrading re1 and rebooting [13:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:20:54] kart_: the wrapper jenkins uses to run lintian... not sure why it fails [13:21:19] PROBLEM - Host cr1-eqiad is DOWN: CRITICAL - Network Unreachable (208.80.154.196) [13:21:19] hmm. Let me check with other packages. [13:21:47] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:22:28] PROBLEM - Host cr1-eqiad IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ffff::1 [13:23:07] RECOVERY - Host cr1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 5.96 ms [13:23:31] icinga lags so much :( [13:25:17] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [13:25:24] kart_: seems like on that integration slave jenkins debian glue 0.17 is not installed [13:25:32] it's the package that provides that file [13:25:42] for some reason a local installation is on other slaves [13:25:46] akosiaris: oh. Thanks! [13:25:54] not sure what has happened here though [13:25:58] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:26:17] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:26:27] RECOVERY - puppet last run on ms-be2022 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [13:27:15] kart_: there you go https://phabricator.wikimedia.org/T141114 [13:27:27] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:27:38] kart_: I think your job was just routed to the wrong backend [13:28:48] RECOVERY - Host cr1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 1.49 ms [13:29:25] akosiaris: cool. [13:29:26] (03PS3) 10KartikMistry: apertium-isl: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/296050 (https://phabricator.wikimedia.org/T107306) [13:29:42] (03CR) 10jenkins-bot: [V: 04-1] apertium-isl: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/296050 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [13:31:47] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/296050 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [13:33:04] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/305498 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [13:35:14] finally some jenkins love. [13:38:57] !log cr1-eqiad: setting "chassis network-services enhanced-ip" and rebooting both REs [13:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:41:57] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [13:42:08] PROBLEM - Host cr1-eqiad is DOWN: CRITICAL - Network Unreachable (208.80.154.196) [13:42:54] can someone check the MW fatals and see what's up with that? [13:43:01] <_joe_> ack [13:43:02] cr1-eqiad is completely depooled for some time now, so it shouldn't be that [13:43:18] PROBLEM - Host cr1-eqiad IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ffff::1 [13:43:33] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-urd-hin] - 10https://gerrit.wikimedia.org/r/296368 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [13:43:49] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0BRpe-5/3/0.32769: down - BR [13:43:49] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0BRpe-5/3/0.32769: down - BR [13:43:56] <_joe_> paravoid: some cirrus failures I'd say [13:43:58] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [13:44:04] <_joe_> not so much that it should worry us [13:44:19] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CRITICAL: puppet fail [13:44:28] _joe_: I'm having a look, cc dcausse [13:44:58] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 206, down: 4, dormant: 0, excluded: 0, unused: 0BRxe-5/2/0: down - Core: cr1-eqiad:xe-5/2/0 {#1983} [10Gbps DF]BRxe-4/3/0: down - Core: cr1-eqiad:xe-4/3/0 {#3456} [10Gbps DF]BRae0: down - Core: cr1-eqiad:ae0BRxe-5/3/0: down - Core: cr1-eqiad:xe-5/3/0 {#2651} [10Gbps DF]BR [13:45:14] <_joe_> paravoid: also, that alarm has a significant lag [13:45:23] yeah :( [13:45:24] <_joe_> the peak was about 10 minutes ago [13:45:32] <_joe_> well it's the nature of the check [13:45:48] <_joe_> it makes sense for long-standing issues, not to detect sudden failures [13:45:55] (03PS3) 10KartikMistry: apertium-eus: Rebuild for Jessie and other fixes [debs/contenttranslation/apertium-eus] - 10https://gerrit.wikimedia.org/r/294673 (https://phabricator.wikimedia.org/T107306) [13:49:00] (03CR) 10jenkins-bot: [V: 04-1] apertium-eus: Rebuild for Jessie and other fixes [debs/contenttranslation/apertium-eus] - 10https://gerrit.wikimedia.org/r/294673 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [13:50:17] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [13:53:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [13:54:14] <_joe_> again another spike of cirrus fatals [13:54:21] most like hhvm pooled http connections and cirrus ^ [13:54:26] (03PS7) 10Ottomata: Refactor zookeeper cluster config so it is available in all hiera scopes [puppet] - 10https://gerrit.wikimedia.org/r/305321 (https://phabricator.wikimedia.org/T143232) [13:54:39] !log cr1-eqiad: shutting down both routing-engines and powering off [13:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:36] the search as you type queries suffer from codfw extra latency :/ [13:56:37] it's 30-35ms round-trip [13:57:12] dcausse: same analysis. We *might* want to increase pool size when switching to codfw. Also, seems that it is mainly API requests, similar to last similar issue. [13:57:31] those queries are usually in the 8-9ms range with eqiad but ~100ms with codfw :( [13:57:57] !log cr1-eqiad: cmjohnson1 replacing both SCBs with SCBE2s and adding a new linecard [13:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:59:47] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 206, down: 4, dormant: 0, excluded: 0, unused: 0BRxe-5/2/0: down - Core: cr1-eqiad:xe-5/2/0 {#1983} [10Gbps DF]BRxe-4/3/0: down - Core: cr1-eqiad:xe-4/3/0 {#3456} [10Gbps DF]BRae0: down - Core: cr1-eqiad:ae0BRxe-5/3/0: down - Core: cr1-eqiad:xe-5/3/0 {#2651} [10Gbps DF]BR [14:00:19] maybe it does 3x roundtrips per query? [14:00:30] (not caching TCP connections?) [14:00:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [14:01:29] bblack: TCP connections should definitely be cached (and SSL negociation). [14:01:36] (by "not caching TCP connections", I don't mean making 3x connections per query, but rather that if it makes a new connection per query, you'll suffer an extra RTT (at least) setting up a new one each time) [14:02:11] if TCP is cached, then the next question is whether a query causes more than one round-trip at the applayer (e.g. 2-3x serial queries) [14:02:26] bblack: the error we are seeing is our pool of cached connections saturating [14:02:38] that's causing ~100ms? [14:02:40] (03PS1) 10Alexandros Kosiaris: check_puppetrun: Improve full fail error message [puppet] - 10https://gerrit.wikimedia.org/r/305504 [14:02:42] (03PS1) 10Alexandros Kosiaris: check_puppetrun: Add failed resource warning/critical levels [puppet] - 10https://gerrit.wikimedia.org/r/305505 [14:03:19] that's a question for dcausse (do we have multiple roundtrip at application level, and where exactly do we measure the query time). [14:03:33] or I guess, it could be that the necessary connection pool size for a given load is going to vary with latency. If every query takes 9ms you need 100 connections, if every query takes 40ms now you need 400 connections to not block? [14:03:54] gehel: yes unfortunately completion queries still uses a 2 pass techniques :( [14:04:26] so maybe both (extra RT per query, and with the added latency per request you need more connections in the pool) [14:04:35] this is something we wanted to fix with elastic 2.x but the feature needed to do it was reverted from the 2.x branch [14:04:48] and is now delayed to elastic 5..x [14:04:55] bblack: yes, that's my assumption (pool size needs to be larger if we go to codfw) [14:05:25] we don't have much metrics on that pool size at the moment, so we only know when we saturate [14:05:53] gehel: I'm pretty sure we have different poolcounter settings for codfw so makes to do the same on hhvm pools [14:06:17] s/makes/it makes sense/ [14:07:10] dcausse: and actually we already have the pool size configured separately for eqiad and codfw, they just happen to have the same size [14:07:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [14:09:47] RECOVERY - puppet last run on ms-be2019 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:09:53] * gehel is not confident in resizing a resource pool without metrics to back the sizing [14:10:07] (03CR) 10Eevans: "> We have two clusters now, so we probably need both commands," [puppet] - 10https://gerrit.wikimedia.org/r/305367 (https://phabricator.wikimedia.org/T143259) (owner: 10Eevans) [14:10:35] gehel: pool counter is already here to protect elastic [14:11:22] but yes it's extremely hard to adjust properly without any metrics :( [14:11:28] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [14:12:38] RECOVERY - Host cr1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.00 ms [14:12:38] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [14:12:46] !log cr1-eqiad: activate chassis redundancy graceful-switchover [14:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:13:04] curl pools are unfortunately not currently runtime changable. I upstreamed a patch to hhvm that both exposes metrics and allows updating pool sizes at runtime, but we havn't updated to that hhvm version yet. might be worth backporting i suppose [14:13:28] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [14:13:28] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [14:14:19] RECOVERY - Host cr1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [14:18:03] !log cr1-eqiad: reenabling all asw row A interfaces [14:18:04] ebernhardson: makes sense yes, do we build and deploy our own version of hhvm? [14:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:23] !log increasing elasticsearch recovery throttling to 40mb to speed up eqiad recovery [14:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:54] gehel: please don't do any ES maintenance now [14:19:13] in the middle of eqiad network work, I want to know what alerts are going to be about [14:19:44] !log cr1-eqiad: reenabling all asw row B/C interfaces [14:19:47] dcausse: yes and sortof, our own version doesn't have any custom patches, but it might have backports [14:19:48] paravoid: sorry... trying to get it back online. reverting... [14:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:54] gehel: no, leave it [14:19:58] just stop making any changes :) [14:20:03] paravoid: ok [14:20:06] (03CR) 10BBlack: varnish: switch from libGeoIP to libmaxminddb (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [14:20:13] we both are, if things alert know we won't know what's at fault [14:20:32] !log cr1-eqiad: reenabling all asw row D interfaces [14:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:20:41] (03PS2) 10BBlack: Remove geoiplookup service IPs from LVS [puppet] - 10https://gerrit.wikimedia.org/r/305420 (https://phabricator.wikimedia.org/T100902) [14:20:43] (03PS2) 10BBlack: GeoIP VCL: remove JSON output support [puppet] - 10https://gerrit.wikimedia.org/r/305421 (https://phabricator.wikimedia.org/T100902) [14:20:45] (03PS2) 10BBlack: www.toolserver.org: remove geoiplookup reference [puppet] - 10https://gerrit.wikimedia.org/r/305418 (https://phabricator.wikimedia.org/T100902) [14:20:47] (03PS2) 10BBlack: GeoIP VCL: re-set old IPv6 no-data cookies [puppet] - 10https://gerrit.wikimedia.org/r/305419 (https://phabricator.wikimedia.org/T99226) [14:20:50] (03PS9) 10BBlack: varnish: switch from libGeoIP to libmaxminddb [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [14:24:08] !log cr1-eqiad: reenabling xe-4/2/0 (link to cr1-codfw) and xe-4/2/2 (link to cr2-knams) [14:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:51] !log cr1-eqiad: reenabling xe-5/0/3 (link to pfw) and the Fundraising BGP group [14:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:08] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 109, down: 0, dormant: 0, excluded: 2, unused: 0 [14:27:17] !log cr1-eqiad: removing deprioritization of all VRRP groups (priority=50) [14:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:23] (03PS3) 10Ottomata: Instance-aware Cassandra restarts for aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/305367 (https://phabricator.wikimedia.org/T143259) (owner: 10Eevans) [14:29:40] !log cr2-eqiad: setting ae3.1019 inet vrrp priority to default (from 50) [14:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:29:46] lol [14:29:48] that simple [14:30:11] yeah [14:31:06] (03PS3) 10BBlack: Remove geoiplookup service IPs from LVS [puppet] - 10https://gerrit.wikimedia.org/r/305420 (https://phabricator.wikimedia.org/T100902) [14:31:09] (03PS3) 10BBlack: GeoIP VCL: remove JSON output support [puppet] - 10https://gerrit.wikimedia.org/r/305421 (https://phabricator.wikimedia.org/T100902) [14:31:10] (03PS3) 10BBlack: GeoIP VCL: re-set old IPv6 no-data cookies [puppet] - 10https://gerrit.wikimedia.org/r/305419 (https://phabricator.wikimedia.org/T99226) [14:31:12] (03PS10) 10BBlack: varnish: switch from libGeoIP to libmaxminddb [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [14:31:21] !log cr1-eqiad: reenabling Private-Peer/Transit interfaces [14:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:26] !log cr1-eqiad: activating Private-Peer4/6 BGP sessions [14:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:32:17] (03CR) 10Ottomata: [C: 032] Instance-aware Cassandra restarts for aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/305367 (https://phabricator.wikimedia.org/T143259) (owner: 10Eevans) [14:32:29] !log cr1-eqiad: activating Transit4/6 BGP sessions [14:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:59] (03PS8) 10Ottomata: Refactor zookeeper cluster config so it is available in all hiera scopes [puppet] - 10https://gerrit.wikimedia.org/r/305321 (https://phabricator.wikimedia.org/T143232) [14:34:53] !log cr1-eqiad: activating PyBal/LVS BGP sessions [14:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:35:40] ok [14:35:43] do we restore statics? :) [14:35:54] not in their state I suppose [14:36:23] (03CR) 10Ottomata: [C: 032] Refactor zookeeper cluster config so it is available in all hiera scopes [puppet] - 10https://gerrit.wikimedia.org/r/305321 (https://phabricator.wikimedia.org/T143232) (owner: 10Ottomata) [14:36:37] RECOVERY - ElasticSearch health check for shards on elastic1043 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 814, number_of_pending_tasks: 11, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 5413, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1328686561, [14:36:57] RECOVERY - ElasticSearch health check for shards on elastic1021 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 811, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 1609, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1655412764, a [14:36:57] RECOVERY - ElasticSearch health check for shards on elastic1042 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 811, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 1685, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1655412764, a [14:36:57] RECOVERY - ElasticSearch health check for shards on elastic1033 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 811, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 1745, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1655412764, a [14:36:57] RECOVERY - ElasticSearch health check for shards on elastic1025 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 811, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 1724, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1655412764, a [14:36:59] RECOVERY - ElasticSearch health check for shards on elastic1039 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 810, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1764321499, acti [14:36:59] RECOVERY - ElasticSearch health check for shards on elastic1031 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 810, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1764321499, acti [14:36:59] RECOVERY - ElasticSearch health check for shards on elastic1035 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 810, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1764321499, acti [14:37:01] why does that recover now? [14:37:17] RECOVERY - ElasticSearch health check for shards on elastic1022 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 810, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1764321499, acti [14:37:17] RECOVERY - ElasticSearch health check for shards on elastic1017 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 810, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1764321499, acti [14:37:18] RECOVERY - ElasticSearch health check for shards on elastic1026 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 810, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1764321499, acti [14:37:18] RECOVERY - ElasticSearch health check for shards on elastic1037 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 810, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1764321499, acti [14:37:18] RECOVERY - ElasticSearch health check for shards on elastic1044 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 810, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1764321499, acti [14:37:18] RECOVERY - ElasticSearch health check for shards on elastic1040 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 810, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1764321499, acti [14:37:18] RECOVERY - ElasticSearch health check for shards on elastic1027 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 810, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1764321499, acti [14:37:19] RECOVERY - ElasticSearch health check for shards on elastic1024 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 810, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1764321499, acti [14:37:19] RECOVERY - ElasticSearch health check for shards on elastic1046 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 810, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1764321499, acti [14:37:20] RECOVERY - ElasticSearch health check for shards on elastic1023 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 810, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1764321499, acti [14:37:38] RECOVERY - ElasticSearch health check for shards on elastic1029 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 809, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1873230233, acti [14:37:38] RECOVERY - ElasticSearch health check for shards on elastic1047 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 809, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1873230233, acti [14:37:38] RECOVERY - ElasticSearch health check for shards on elastic1041 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 809, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1873230233, acti [14:37:38] RECOVERY - ElasticSearch health check for shards on elastic1019 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 809, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1873230233, acti [14:37:48] RECOVERY - ElasticSearch health check for shards on elastic1030 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 804, number_of_pending_tasks: 18, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 9445, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.2417773905, [14:37:48] RECOVERY - ElasticSearch health check for shards on elastic1032 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 803, number_of_pending_tasks: 19, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 9605, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.252668264, a [14:37:49] RECOVERY - ElasticSearch health check for shards on elastic1045 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 803, number_of_pending_tasks: 19, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 9654, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.252668264, a [14:37:49] RECOVERY - ElasticSearch health check for shards on elastic1038 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 803, number_of_pending_tasks: 19, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 9700, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.252668264, a [14:37:49] RECOVERY - ElasticSearch health check for shards on elastic1034 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 803, number_of_pending_tasks: 19, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 9717, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.252668264, a [14:37:49] RECOVERY - ElasticSearch health check for shards on elastic1020 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 803, number_of_pending_tasks: 19, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 9749, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.252668264, a [14:37:49] RECOVERY - ElasticSearch health check for shards on elastic1018 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 803, number_of_pending_tasks: 19, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 9777, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.252668264, a [14:37:50] RECOVERY - ElasticSearch health check for shards on elastic1036 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 803, number_of_pending_tasks: 19, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 9780, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.252668264, a [14:37:50] RECOVERY - ElasticSearch health check for shards on elastic1028 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 803, number_of_pending_tasks: 19, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3037, task_max_waiting_in_queue_millis: 9793, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.252668264, a [14:38:10] (03PS1) 10Jcrespo: Add public logic for grants to m5 db for striker application [puppet] - 10https://gerrit.wikimedia.org/r/305506 (https://phabricator.wikimedia.org/T142545) [14:38:42] !log cr1-eqiad: JunOS, SCB and linecard upgrade is over [14:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:58] PROBLEM - puppet last run on aqs1004 is CRITICAL: CRITICAL: puppet fail [14:39:06] so how does that ES recovery now make sense? [14:39:24] unrelated [14:39:27] it just recovered I think [14:39:42] synced with each other that is [14:39:49] i guess [14:39:59] If I can now ask w/o joggling elbows… that was a network card becoming an ex network card, right? [14:40:05] no [14:40:25] that was just network maintenance with buggy router software [14:40:58] RECOVERY - puppet last run on aqs1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:41:04] !log cr1-eqiad and cr2-eqiad: replace obsolete PIM BFD statements with new family inet/inet6 ones [14:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:33] Ah. [14:42:25] (03PS6) 10RobH: Enabling all users for request of ISI Foundation team [puppet] - 10https://gerrit.wikimedia.org/r/305398 [14:44:25] gehel, dcausse: eqiad network maint is over and apparently ES in eqiad recovered, you may want to revert that codfw switchover commit? [14:45:13] paravoid: there is still >600 unassigned shards [14:45:16] paravoid: sure, cluster is not yet green but we should revert shortly [14:45:29] BTW, whenever someone has a moment I still have an issue from the other day. [14:45:32] oh ok, I saw all those RECOVERYs above [14:45:51] can you guys fix our alerting a little bit? [14:46:00] it's not super urgent but it'd be super nice :) [14:46:32] specifically, for cluster errors, not individual nodes, it'd be better if we were getting one alert [14:46:40] (and that one should probably be made a paging alert?) [14:46:47] paravoid: agreed [14:46:51] does that make sense? shall I open a task for it? [14:46:52] paravoid: yes, agreed [14:47:11] paravoid: I think I have a task for it, lemme check [14:47:29] T133844 [14:47:31] T133844: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844 [14:49:04] (03PS1) 10Gehel: Revert "Switching search traffic to codfw as eqiad seems unstable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305508 [14:49:29] let's raise its priority then :) [14:50:19] paravoid: ready for the revert... [14:51:29] (03PS1) 10Aude: Bump cache epoch for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305509 (https://phabricator.wikimedia.org/T143249) [14:51:30] dcausse: do you want to check anything more on that eqiad cluster before I send traffic back to it? [14:52:04] gehel: you don't want to wait for green? [14:52:27] (03PS1) 10Ottomata: Enable MirrorMaker from main-eqiad to main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/305510 (https://phabricator.wikimedia.org/T134184) [14:53:18] (03PS2) 10Jhobs: Deploy lazy loaded images to mobile web [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305387 (https://phabricator.wikimedia.org/T142399) [14:53:24] dcausse: additional latency vs additional load on the cluster... not sure one is worse than the other [14:54:11] !Log ms-be1005 replacing disk Slot Number: 3 [14:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:15] gehel: I'd wait for 300 unassigned shards it's equivalent roughly to one node restart [14:54:31] dcausse: make sense, so waiting a bit more... [14:54:58] (03CR) 10Ottomata: [C: 032] Enable MirrorMaker from main-eqiad to main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/305510 (https://phabricator.wikimedia.org/T134184) (owner: 10Ottomata) [14:54:58] RECOVERY - MegaRAID on ms-be1005 is OK: OK: optimal, 13 logical, 13 physical [14:57:08] (03PS1) 10Jcrespo: Fake passwords to mimic in labs the striker-database ones [labs/private] - 10https://gerrit.wikimedia.org/r/305512 (https://phabricator.wikimedia.org/T142545) [14:58:31] (03CR) 10Jcrespo: [C: 032 V: 032] Fake passwords to mimic in labs the striker-database ones [labs/private] - 10https://gerrit.wikimedia.org/r/305512 (https://phabricator.wikimedia.org/T142545) (owner: 10Jcrespo) [14:59:55] suppose i could do swat today [15:00:04] anomie, ostriches, thcipriani, hashar, twentyafterfour, and aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160818T1500). [15:00:04] jhobs and aude: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:18] if it's okay to proceed [15:00:48] paravoid: mark is it okay to do swat or do we need to wait? [15:01:08] I'm here [15:02:14] aude: you can go ahead [15:02:18] thanks for checking :) [15:02:24] paravoid: ok [15:03:56] (03CR) 10Jcrespo: [C: 031] "Looks good to me: https://puppet-compiler.wmflabs.org/3760/db1009.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/305506 (https://phabricator.wikimedia.org/T142545) (owner: 10Jcrespo) [15:04:16] (03PS3) 10Jhobs: Deploy lazy loaded images to mobile web [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305387 (https://phabricator.wikimedia.org/T142399) [15:05:00] (03CR) 10Aude: [C: 032] Deploy lazy loaded images to mobile web [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305387 (https://phabricator.wikimedia.org/T142399) (owner: 10Jhobs) [15:05:29] (03Merged) 10jenkins-bot: Deploy lazy loaded images to mobile web [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305387 (https://phabricator.wikimedia.org/T142399) (owner: 10Jhobs) [15:06:49] jhobs: can you check on mw1099? [15:07:49] aude: just to double check, the header I should use is X-Wikimedia-Debug: backend=mw1099.eqiad.wmnet, right? [15:08:20] yes [15:08:31] it looks okay to me (or at least not broken) [15:09:17] seems to work [15:11:01] aude: LGTM! [15:11:09] ok [15:12:37] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable lazy loaded images on mobile web (duration: 01m 00s) [15:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:47] ^ [15:13:46] (03CR) 10Aude: [C: 032] Bump cache epoch for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305509 (https://phabricator.wikimedia.org/T143249) (owner: 10Aude) [15:14:59] aude: thanks, looks good! [15:15:03] :) [15:17:32] (03PS2) 10Aude: Bump cache epoch for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305509 (https://phabricator.wikimedia.org/T143249) [15:17:39] (03CR) 10Aude: Bump cache epoch for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305509 (https://phabricator.wikimedia.org/T143249) (owner: 10Aude) [15:17:43] (03CR) 10Aude: [C: 032] Bump cache epoch for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305509 (https://phabricator.wikimedia.org/T143249) (owner: 10Aude) [15:18:12] (03Merged) 10jenkins-bot: Bump cache epoch for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305509 (https://phabricator.wikimedia.org/T143249) (owner: 10Aude) [15:19:24] (03PS1) 10Ema: varnishlog4: allow methods to be used as callbacks [puppet] - 10https://gerrit.wikimedia.org/r/305517 (https://phabricator.wikimedia.org/T131353) [15:21:42] !log aude@tin Synchronized wmf-config/Wikibase.php: Bump cache epoch for Wikidata (duration: 00m 49s) [15:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:57] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:31:27] (03PS1) 10Gehel: Elasticsearch - check shards via the service, not via each individual node [puppet] - 10https://gerrit.wikimedia.org/r/305519 (https://phabricator.wikimedia.org/T133844) [15:40:43] (03PS3) 10Gehel: Maps - increase Posgresql max_wal_sender [puppet] - 10https://gerrit.wikimedia.org/r/305471 [15:42:22] (03CR) 10Gehel: [C: 032] Maps - increase Posgresql max_wal_sender [puppet] - 10https://gerrit.wikimedia.org/r/305471 (owner: 10Gehel) [15:48:57] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 211, down: 0, dormant: 0, excluded: 0, unused: 0 [15:49:12] (03CR) 10BryanDavis: [C: 031] "IP addrs and grants look right to me." [puppet] - 10https://gerrit.wikimedia.org/r/305506 (https://phabricator.wikimedia.org/T142545) (owner: 10Jcrespo) [15:50:56] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 700 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4751396 keys - replication_delay is 700 [15:51:15] (03PS4) 10BBlack: Remove geoiplookup service IPs from LVS [puppet] - 10https://gerrit.wikimedia.org/r/305420 (https://phabricator.wikimedia.org/T100902) [15:51:17] (03PS4) 10BBlack: GeoIP VCL: remove JSON output support [puppet] - 10https://gerrit.wikimedia.org/r/305421 (https://phabricator.wikimedia.org/T100902) [15:51:19] (03PS3) 10BBlack: www.toolserver.org: remove geoiplookup reference [puppet] - 10https://gerrit.wikimedia.org/r/305418 (https://phabricator.wikimedia.org/T100902) [15:51:21] (03PS4) 10BBlack: GeoIP VCL: re-set old IPv6 no-data cookies [puppet] - 10https://gerrit.wikimedia.org/r/305419 (https://phabricator.wikimedia.org/T99226) [15:51:23] (03PS11) 10BBlack: varnish: switch from libGeoIP to libmaxminddb [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [15:51:53] !log reset postgresql replication on maps1001 [15:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:52:42] !log reset postgresql replication on maps1003 (correction, not maps1001) [15:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:06] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4716704 keys - replication_delay is 0 [15:53:16] (03CR) 10BBlack: "Recap of recent PS:" [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [15:54:31] (03PS5) 10BBlack: Remove geoiplookup service IPs from LVS [puppet] - 10https://gerrit.wikimedia.org/r/305420 (https://phabricator.wikimedia.org/T100902) [15:54:33] (03PS5) 10BBlack: GeoIP VCL: remove JSON output support [puppet] - 10https://gerrit.wikimedia.org/r/305421 (https://phabricator.wikimedia.org/T100902) [15:54:35] (03PS5) 10BBlack: GeoIP VCL: re-set old IPv6 no-data cookies [puppet] - 10https://gerrit.wikimedia.org/r/305419 (https://phabricator.wikimedia.org/T99226) [15:54:37] (03PS12) 10BBlack: varnish: switch from libGeoIP to libmaxminddb [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [15:59:55] (03PS1) 10Ema: Set varnish::common::varnish4_python_suffix [puppet] - 10https://gerrit.wikimedia.org/r/305522 [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160818T1600). [16:00:04] tgr and Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:02:30] I've got https://gerrit.wikimedia.org/r/#/c/301404/ too if puppet swat is actually going to happen today. It was in the last batch that never got deployed [16:02:31] (03PS2) 10Ema: Set varnish::common::varnish4_python_suffix [puppet] - 10https://gerrit.wikimedia.org/r/305522 [16:06:36] yeah I can puppet SWAT [16:06:53] bd808: mind putting that up as well on the deployment calendar? [16:07:02] will do [16:07:06] thanks! [16:07:46] tgr: here? I'm going to start with https://gerrit.wikimedia.org/r/#/c/302650 [16:07:57] godog: o/ [16:08:25] the patch is not testable but trivial [16:08:27] (03PS3) 10Filippo Giunchedi: Increase retries for rename jobs [puppet] - 10https://gerrit.wikimedia.org/r/302650 (https://phabricator.wikimedia.org/T141731) (owner: 10Gergő Tisza) [16:08:57] (03PS3) 10Rush: sge collector: set correct env [puppet] - 10https://gerrit.wikimedia.org/r/305401 [16:09:31] godog: added -- https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=817707&oldid=817700 [16:09:37] yup, waiting for jenkins to do its thing tgr [16:09:52] (03CR) 10Filippo Giunchedi: [C: 032] Increase retries for rename jobs [puppet] - 10https://gerrit.wikimedia.org/r/302650 (https://phabricator.wikimedia.org/T141731) (owner: 10Gergő Tisza) [16:09:58] (03CR) 10Ema: [C: 032] Set varnish::common::varnish4_python_suffix [puppet] - 10https://gerrit.wikimedia.org/r/305522 (owner: 10Ema) [16:10:06] (03PS3) 10Ema: Set varnish::common::varnish4_python_suffix [puppet] - 10https://gerrit.wikimedia.org/r/305522 [16:10:10] (03CR) 10Ema: [V: 032] Set varnish::common::varnish4_python_suffix [puppet] - 10https://gerrit.wikimedia.org/r/305522 (owner: 10Ema) [16:12:30] (03PS1) 10Ottomata: Mirror codfw* topics from Kafka main-codfw to main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/305523 (https://phabricator.wikimedia.org/T134184) [16:12:47] bd808: nice, thanks [16:13:03] tgr: {{done}} it'll deploy in the next 30min [16:13:31] manually ran puppet on mw1162 to test, runs/restarts fine [16:14:26] godog: thanks! [16:15:06] np, Krenair here? [16:15:07] (03PS2) 10Ottomata: Mirror codfw* topics from Kafka main-codfw to main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/305523 (https://phabricator.wikimedia.org/T134184) [16:15:11] (03CR) 10Ottomata: [C: 032 V: 032] Mirror codfw* topics from Kafka main-codfw to main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/305523 (https://phabricator.wikimedia.org/T134184) (owner: 10Ottomata) [16:15:14] hi godog [16:16:15] (03PS2) 10Ema: varnishlog4: allow methods to be used as callbacks [puppet] - 10https://gerrit.wikimedia.org/r/305517 (https://phabricator.wikimedia.org/T131353) [16:16:17] (03PS1) 10Ema: Port varnishprocessor to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/305525 (https://phabricator.wikimedia.org/T131353) [16:16:18] !log restart nodepool with STATSD_HOST env variable for test [16:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:35] (03CR) 10Rush: [C: 032] sge collector: set correct env [puppet] - 10https://gerrit.wikimedia.org/r/305401 (owner: 10Rush) [16:17:42] (03PS4) 10Rush: sge collector: set correct env [puppet] - 10https://gerrit.wikimedia.org/r/305401 [16:17:53] (03CR) 10Rush: [V: 032] sge collector: set correct env [puppet] - 10https://gerrit.wikimedia.org/r/305401 (owner: 10Rush) [16:18:03] godog, I'm looking at my first patch again... I think the novaconfig network_public_ip comes from common.yaml [16:18:44] Krenair: heh I was gonna say, did you run the puppet compiler on those already? [16:19:20] think I was trying to get and.rew to look at the patch before he went away [16:19:43] don't remember whether I put it through puppet compiler [16:19:52] will do now [16:20:47] ok, thanks, ditto for the beta bits removal ones [16:22:14] godog, how do you propose I put the beta one through puppet compiler..? [16:24:02] (03PS3) 10Ema: varnishlog4: allow methods to be used as callbacks [puppet] - 10https://gerrit.wikimedia.org/r/305517 (https://phabricator.wikimedia.org/T131353) [16:24:04] (03PS2) 10Ema: Port varnishprocessor to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/305525 (https://phabricator.wikimedia.org/T131353) [16:24:06] (03PS1) 10Ema: Port varnishmedia to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/305527 (https://phabricator.wikimedia.org/T131353) [16:24:07] Krenair: mh I can't tell right away from the gerrit ui, does https://gerrit.wikimedia.org/r/#/c/303121 already depend on https://gerrit.wikimedia.org/r/#/c/303123/ ? if so running the compiler on the former will dtrt [16:24:50] yes [16:25:07] which hosts should I test the apache module change on? [16:25:16] mediawikis? [16:26:25] (03CR) 10Alex Monk: "http://puppet-compiler.wmflabs.org/3763/" [puppet] - 10https://gerrit.wikimedia.org/r/302835 (owner: 10Alex Monk) [16:27:29] Krenair: yeah that should do it [16:28:54] 06Operations, 10Pybal, 10Traffic: Unhandled pybal ValueError: need more than 1 value to unpack - https://phabricator.wikimedia.org/T143078#2564802 (10ema) p:05Triage>03Normal [16:29:15] (03PS1) 10Rush: nodepool: use integrated statsd reporting [puppet] - 10https://gerrit.wikimedia.org/r/305529 [16:29:45] Krenair: I agree though would be better to get and.rew to sign off on the first [16:31:07] (03PS1) 10Ottomata: Mirror all topics in main-eqiad topics into analytics-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/305530 (https://phabricator.wikimedia.org/T134184) [16:31:33] godog, okay [16:31:46] let's remove that one from the list then [16:32:13] godog, http://puppet-compiler.wmflabs.org/3764/ [16:32:41] (03PS2) 10Ottomata: Mirror all topics in main-eqiad topics into analytics-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/305530 (https://phabricator.wikimedia.org/T134184) [16:32:48] nice, thanks Krenair [16:32:56] (03PS2) 10Filippo Giunchedi: apache conf: Allow source and content to be undefined if ensure is absent [puppet] - 10https://gerrit.wikimedia.org/r/303123 (owner: 10Alex Monk) [16:33:30] (03PS3) 10Ottomata: Mirror all topics in main-eqiad topics into analytics-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/305530 (https://phabricator.wikimedia.org/T134184) [16:34:47] (03CR) 10Thcipriani: [C: 031] nodepool: use integrated statsd reporting [puppet] - 10https://gerrit.wikimedia.org/r/305529 (owner: 10Rush) [16:35:02] (03CR) 10Filippo Giunchedi: [C: 032] apache conf: Allow source and content to be undefined if ensure is absent [puppet] - 10https://gerrit.wikimedia.org/r/303123 (owner: 10Alex Monk) [16:35:46] 06Operations: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2561502 (10bd808) It looks like the rdoc parser hates something about the modules/puppetdbquery/bin/find-nodes file introduced in rOPUPb1492be. That whole module is marked as being exclude... [16:37:47] (03PS4) 10Ottomata: Mirror all topics in main-eqiad topics into analytics-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/305530 (https://phabricator.wikimedia.org/T134184) [16:38:07] (03CR) 10Rush: [C: 032] nodepool: use integrated statsd reporting [puppet] - 10https://gerrit.wikimedia.org/r/305529 (owner: 10Rush) [16:38:09] (03PS3) 10Filippo Giunchedi: Remove bits.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/303121 (https://phabricator.wikimedia.org/T107430) (owner: 10Alex Monk) [16:38:12] (03PS2) 10Rush: nodepool: use integrated statsd reporting [puppet] - 10https://gerrit.wikimedia.org/r/305529 [16:38:15] (03CR) 10Rush: [V: 032] nodepool: use integrated statsd reporting [puppet] - 10https://gerrit.wikimedia.org/r/305529 (owner: 10Rush) [16:38:23] (03PS5) 10Ottomata: Mirror all topics in main-eqiad topics into analytics-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/305530 (https://phabricator.wikimedia.org/T134184) [16:39:54] (03CR) 10Ottomata: [C: 032 V: 032] Mirror all topics in main-eqiad topics into analytics-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/305530 (https://phabricator.wikimedia.org/T134184) (owner: 10Ottomata) [16:39:57] (03CR) 10Filippo Giunchedi: [C: 032] Remove bits.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/303121 (https://phabricator.wikimedia.org/T107430) (owner: 10Alex Monk) [16:40:12] (03PS4) 10Filippo Giunchedi: Remove bits.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/303121 (https://phabricator.wikimedia.org/T107430) (owner: 10Alex Monk) [16:40:15] (03CR) 10Filippo Giunchedi: [V: 032] Remove bits.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/303121 (https://phabricator.wikimedia.org/T107430) (owner: 10Alex Monk) [16:40:53] Krenair: {{done}} let me know if it worked [16:43:11] 06Operations: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2564858 (10bd808) FWIW, I don't get that failure when I run `bundle exec rake doc` locally: ``` $ bundle exec rake doc Running puppet doc --mode rdoc --all --manifestdir manifests --modul... [16:43:35] 06Operations, 10Traffic, 13Patch-For-Review, 05WMF-deploy-2016-08-09_(1.28.0-wmf.14): Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2564863 (10BBlack) [16:43:37] 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog: Wikipedia app hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2564860 (10BBlack) 05Open>03Resolved a:03BBlack With no movement for a couple of weeks here and the various above comments (only outdated app versions, ok... [16:44:21] (03CR) 10Filippo Giunchedi: [C: 032] mediawiki::php: /etc/php5/apache2 provided by php5-dbg [puppet] - 10https://gerrit.wikimedia.org/r/301404 (owner: 10BryanDavis) [16:44:28] (03PS3) 10Filippo Giunchedi: mediawiki::php: /etc/php5/apache2 provided by php5-dbg [puppet] - 10https://gerrit.wikimedia.org/r/301404 (owner: 10BryanDavis) [16:46:17] godog, yeah looks fine [16:47:09] Krenair: nice [16:47:55] godog, we could also do https://gerrit.wikimedia.org/r/#/c/303122/ at some point [16:48:31] but I won't be around next week, so [16:48:44] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10netops: Open hole in analytics vlan firewall to allow MirrorMaker to talk to main Kafka clusters - https://phabricator.wikimedia.org/T143335#2564917 (10Ottomata) [16:48:46] (03PS1) 10BBlack: Remove bits.wikimedia.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/305533 (https://phabricator.wikimedia.org/T107430) [16:49:55] 06Operations, 06Services, 07Service-deployment-requests, 15User-mobrovac: Investigate better protection modes for electron render service - https://phabricator.wikimedia.org/T143336#2564937 (10GWicke) [16:49:55] Krenair: if your previous one is already applied in beta let's do it now, it is already cruft at this point [16:50:13] 06Operations, 06Services, 07Service-deployment-requests, 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2564952 (10GWicke) [16:50:33] (03CR) 10Jforrester: "Might be worth doing this on top of I3d61178ce ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304991 (https://phabricator.wikimedia.org/T113877) (owner: 10Gilles) [16:51:12] godog, yep it's gone, let's do it [16:51:56] (03PS3) 10Filippo Giunchedi: beta apaches: Delete Apache::Site['wmflabs'] too [puppet] - 10https://gerrit.wikimedia.org/r/303122 (https://phabricator.wikimedia.org/T107430) (owner: 10Alex Monk) [16:52:17] bd808: yours is merged too btw [16:52:30] cool [16:53:18] (03CR) 10Filippo Giunchedi: [C: 032] beta apaches: Delete Apache::Site['wmflabs'] too [puppet] - 10https://gerrit.wikimedia.org/r/303122 (https://phabricator.wikimedia.org/T107430) (owner: 10Alex Monk) [16:53:43] 06Operations, 06Services, 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2564981 (10mobrovac) [16:53:46] Krenair: ok all done! [16:53:52] great, thanks [16:54:01] godog: if all the MW servers start falling on their face during the puppet runs you can blame my patch (/me expects that not to happen) [16:54:22] 06Operations, 06Services, 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2564937 (10mobrovac) 05Open>03stalled a:05mobrovac>03None Setting as stalled, since there is no real action on our side for the moment. [16:54:24] 06Operations, 06Services, 13Patch-For-Review, 07Service-deployment-requests, 15User-mobrovac: New service request - PDF Render - https://phabricator.wikimedia.org/T143129#2564989 (10mobrovac) [16:54:45] bd808: heheh no worries, I've spot-checked a couple of trusty appservers after merging [16:55:45] 06Operations, 06Services, 13Patch-For-Review, 07Service-deployment-requests, 15User-mobrovac: New service request - PDF Render - https://phabricator.wikimedia.org/T143129#2557819 (10mobrovac) [16:55:47] 06Operations, 10Security-Reviews, 06Services, 06Services-next, 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2564998 (10mobrovac) [16:55:49] 06Operations, 06Services, 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2564996 (10mobrovac) [16:56:37] (03PS1) 10BBlack: text VCL: remove bits.wm.o stuff [puppet] - 10https://gerrit.wikimedia.org/r/305535 (https://phabricator.wikimedia.org/T107430) [16:56:43] !log mathoid deploying 75606c71 [16:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:58:40] (03PS1) 10BBlack: MW apache: remove bits.wm.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/305536 (https://phabricator.wikimedia.org/T107430) [16:59:02] (03PS1) 10Rush: nodepool: reduce ready count for jessie and set higher rate [puppet] - 10https://gerrit.wikimedia.org/r/305537 (https://phabricator.wikimedia.org/T143016) [16:59:20] (03PS2) 10Rush: nodepool: reduce ready count for jessie and set higher rate [puppet] - 10https://gerrit.wikimedia.org/r/305537 (https://phabricator.wikimedia.org/T143016) [16:59:48] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2565012 (10Cmjohnson) Spoke with Dell support technician Robert Thaler today. We went over some things that were already one and he's also stumped by the issue. He did state that there... [17:00:05] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160818T1700). [17:00:15] no parsoid deploy today [17:00:21] Hey jouncebot. nothing today :) [17:00:24] (for ORES) [17:00:49] we should have jouncebot reply something back, just like morebots does, just for the fun of it [17:01:44] they should talk to each other! [17:02:21] haha [17:03:46] (03CR) 10Thcipriani: [C: 031] nodepool: reduce ready count for jessie and set higher rate [puppet] - 10https://gerrit.wikimedia.org/r/305537 (https://phabricator.wikimedia.org/T143016) (owner: 10Rush) [17:03:50] (03CR) 10BryanDavis: Provision Striker via scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301505 (https://phabricator.wikimedia.org/T141014) (owner: 10BryanDavis) [17:04:09] which reminds me I still haven't pushed through converting some to NOTICE instead of PRIVMSG [17:04:24] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2565026 (10faidon) The RAID controller is actually useful and expensive. They sold us this system, in this configuration, with a RAID controller (w/ a BBU) and those specific disks. Can y... [17:05:59] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2565029 (10Cmjohnson) The dell tech did look and told me there are non 4k 6TB disks we could use. http://accessories.ap.dell.com/sna/productdetail.aspx?c=au&l=en&s=dhs&cs=audhs1&sku=400-A... [17:08:17] 06Operations, 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service: Clean up puppet & configs for ORES - https://phabricator.wikimedia.org/T142002#2565031 (10Halfak) Hi @akosiaris. It seems that you've commented on the naming scheme. I agree that we currently have "99-main.yaml" in production, but I think w... [17:08:25] (03CR) 10Rush: [C: 032] nodepool: reduce ready count for jessie and set higher rate [puppet] - 10https://gerrit.wikimedia.org/r/305537 (https://phabricator.wikimedia.org/T143016) (owner: 10Rush) [17:10:52] 06Operations, 10Traffic, 13Patch-For-Review, 05WMF-deploy-2016-08-09_(1.28.0-wmf.14): Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2565036 (10BBlack) I'd like to start the decom here with the DNS removal of the `bits.wikimedia.org` hostname itself, so that the traffic dies... [17:12:24] !log nodepool restart with new settings for rate/env/ready for jessie from puppet [17:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:23] !log reinstall ms-be1027 after ssd replaced T140374 [17:15:24] T140374: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374 [17:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:16:10] PROBLEM - puppet last run on mw2100 is CRITICAL: CRITICAL: puppet fail [17:17:02] 06Operations, 10hardware-requests: Site: 8 hardware access request for ORES - https://phabricator.wikimedia.org/T142578#2539946 (10RobH) We don't have any single CPU hosts (they tend to be 10 cores these days at 2.4GHz). The specs for this are very similar to our recent allocations/purchases for puppetmaster... [17:18:17] elasticsearch eqiad is mostly recovered, I'm switching traffic back [17:19:37] !log elasticsearch eqiad recovery throttling back to standard 20mb recovery [17:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:19:48] (03CR) 10Gehel: [C: 032] Revert "Switching search traffic to codfw as eqiad seems unstable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305508 (owner: 10Gehel) [17:19:56] (03PS2) 10Gehel: Revert "Switching search traffic to codfw as eqiad seems unstable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305508 [17:20:24] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2565098 (10RobH) We'll need our Dell reps looped in, as they did sell us this config and it should work. In addition, we had to buy cables, since one broke diagnosing an issue that they... [17:22:45] (03PS4) 10Yuvipanda: dynamicproxy: Add nginx config to redirect www.wmflabs.org/wmflabs.org to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/303938 (https://phabricator.wikimedia.org/T38885) (owner: 10Alex Monk) [17:24:41] !log gehel@tin Synchronized wmf-config/InitialiseSettings.php: switching search back to eqiad (duration: 00m 49s) [17:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:25:27] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Add nginx config to redirect www.wmflabs.org/wmflabs.org to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/303938 (https://phabricator.wikimedia.org/T38885) (owner: 10Alex Monk) [17:27:19] 06Operations: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2565114 (10RobH) It appears that was a change from @joe, so looping him onto this task. [17:28:45] 06Operations, 10hardware-requests: codfw/eqiad:(4+4) hardware access request for ORES - https://phabricator.wikimedia.org/T142578#2565118 (10RobH) [17:31:29] RECOVERY - HP RAID on ms-be1027 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:31:30] RECOVERY - MD RAID on ms-be1027 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 [17:36:16] (03PS1) 10Yuvipanda: dynamicproxy: Don't set dynamic proxy to be defaut server [puppet] - 10https://gerrit.wikimedia.org/r/305540 [17:36:29] 06Operations, 10Traffic, 13Patch-For-Review, 05WMF-deploy-2016-08-09_(1.28.0-wmf.14): Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2565144 (10BBlack) (edited above to note it's just favicon, not others, that's these bulk). Also notable, many of these self-referred favicon... [17:36:53] (03PS2) 10Yuvipanda: dynamicproxy: Don't set dynamic proxy to be defaut server [puppet] - 10https://gerrit.wikimedia.org/r/305540 [17:36:59] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Don't set dynamic proxy to be defaut server [puppet] - 10https://gerrit.wikimedia.org/r/305540 (owner: 10Yuvipanda) [17:39:51] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [17:44:20] 06Operations: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2565154 (10bd808) This upstream bug seems related -- [[https://tickets.puppetlabs.com/browse/PUP-3261|puppet doc passes files to rdoc too agressively]]. Sadly open for 2 years with no acti... [17:44:39] RECOVERY - puppet last run on mw2100 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:49:29] 06Operations, 10Mail: Emails dropping from Greenhouse to Alan - https://phabricator.wikimedia.org/T142427#2565187 (10bbogaert) 05Open>03Resolved a:03bbogaert Hi Alex, Thanks double checking this! We double-triple-checked the spelling of his name. I think it might be a greenhouse issue. We'll pass the "p... [17:52:34] Krinkle: any parting thoughts on https://phabricator.wikimedia.org/T107430#2565036 ? I'm kind of torn between waiting till next week for any followup commentary, and just Being Bold and removing it today while I'm around to watch for complaints (and expecting none, probably). [17:54:41] PROBLEM - HP RAID on ms-be1022 is CRITICAL: Connection refused by host [17:54:50] PROBLEM - swift-account-server on ms-be1022 is CRITICAL: Connection refused by host [17:55:01] PROBLEM - swift-container-replicator on ms-be1022 is CRITICAL: Connection refused by host [17:55:04] that's me, silencing [17:55:10] PROBLEM - Check size of conntrack table on ms-be1022 is CRITICAL: Connection refused by host [17:55:10] PROBLEM - swift-object-auditor on ms-be1022 is CRITICAL: Connection refused by host [17:55:21] PROBLEM - swift-container-server on ms-be1022 is CRITICAL: Connection refused by host [17:55:21] PROBLEM - NTP on ms-be1022 is CRITICAL: NTP CRITICAL: No response from NTP server [17:55:29] PROBLEM - configured eth on ms-be1022 is CRITICAL: Connection refused by host [17:58:04] (03PS1) 10Yuvipanda: labs: Don't set any default_server for novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/305546 [17:59:03] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Don't set any default_server for novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/305546 (owner: 10Yuvipanda) [17:59:27] (03CR) 1020after4: [C: 031] phabricator: allow ssh between servers for cluster support [puppet] - 10https://gerrit.wikimedia.org/r/305277 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [18:05:40] 06Operations, 10ops-eqiad: ms-be1005 - MegaRAID - CRITICAL: 1 failed LD(s) (Offline) - https://phabricator.wikimedia.org/T143265#2565277 (10Dzahn) cool! and so fast! thank you Icinga says that MegaRAID says "OK: optimal, 14 logical, 14 physical " again :) [18:13:48] 10Blocked-on-Operations, 06Operations, 10Cassandra, 06Services: Remove obsolete metrics - https://phabricator.wikimedia.org/T139792#2442835 (10RobH) Since this can now be filtered, would we want to remove that historical data at all anymore? (Just asking since this is under blocked on ops and I'm triaging... [18:14:29] PROBLEM - swift-account-replicator on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [18:15:56] ^ me again, silenced now [18:18:38] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10netops: Open hole in analytics vlan firewall to allow MirrorMaker to talk to main Kafka clusters - https://phabricator.wikimedia.org/T143335#2565307 (10faidon) 05Open>03Resolved a:03faidon Should be done! [18:23:25] 10Blocked-on-Operations, 06Operations, 10Cassandra, 06Services: Remove obsolete metrics - https://phabricator.wikimedia.org/T139792#2565335 (10Eevans) >>! In T139792#2565295, @RobH wrote: > Since this can now be filtered, would we want to remove that historical data at all anymore? (Just asking since this... [18:24:05] (03PS4) 10BBlack: www.toolserver.org: remove geoiplookup reference [puppet] - 10https://gerrit.wikimedia.org/r/305418 (https://phabricator.wikimedia.org/T100902) [18:27:48] 10Blocked-on-Operations, 06Operations, 10Cassandra, 06Services: Remove obsolete metrics - https://phabricator.wikimedia.org/T139792#2565390 (10RobH) 05Open>03declined I asked about this in IRC, and the limited feedback I received seems to align with 'don't get rid of performance data'. We may want it,... [18:28:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [18:30:18] (03PS3) 10Jforrester: Change default gallery mode to 'packed' on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301129 (https://phabricator.wikimedia.org/T141349) [18:30:20] (03PS3) 10Jforrester: Test setting gallery config differently on Beta Cluster enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 [18:30:57] RECOVERY - swift-account-replicator on ms-be1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [18:36:49] 06Operations, 06Release-Engineering-Team, 15User-greg, 07Wikimedia-Incident: Institute quarterly(?) review of incident reports and follow-up - https://phabricator.wikimedia.org/T141287#2565426 (10greg) >>! In T141287#2522931, @greg wrote: > * ACTION: Greg will follow up with Faidon and Kevin via email in 2... [18:38:53] (03PS1) 10Dzahn: phabricator: don't run logmail crons on inactive server [puppet] - 10https://gerrit.wikimedia.org/r/305556 [18:43:21] (03CR) 10Dzahn: [C: 032] "no-op on iridium, disables them on phab2001" [puppet] - 10https://gerrit.wikimedia.org/r/305556 (owner: 10Dzahn) [18:43:25] (03PS1) 10Dereckson: Set timezone to Europe/Ljubljana on sl. projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305558 (https://phabricator.wikimedia.org/T142701) [18:43:43] (03PS2) 10Dzahn: phabricator: don't run logmail crons on inactive server [puppet] - 10https://gerrit.wikimedia.org/r/305556 [18:44:27] (03CR) 10BBlack: [C: 032] www.toolserver.org: remove geoiplookup reference [puppet] - 10https://gerrit.wikimedia.org/r/305418 (https://phabricator.wikimedia.org/T100902) (owner: 10BBlack) [18:44:44] (03PS4) 10Smalyshev: Make Updater proper service [puppet] - 10https://gerrit.wikimedia.org/r/303626 (https://phabricator.wikimedia.org/T116754) [18:45:41] (03PS3) 10Dzahn: phabricator: don't run logmail crons on inactive server [puppet] - 10https://gerrit.wikimedia.org/r/305556 [18:49:30] !log phab2001 manually removed phabricator crons since puppet is disabled there [18:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:36] arrrg, again [18:50:16] (03PS3) 10Phedenskog: Enable PerformanceInspector extension for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304992 [18:51:07] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:00:05] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160818T1900). Please do the needful. [19:02:49] bblack: I'd like to wait a little bit longer [19:02:55] I havent' looked at the actual traffic in a while [19:03:04] and I cleaned up important uses not too long ago, see the task [19:04:09] <_joe_> . [19:10:32] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.15 [19:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:55] Krinkle: you mean the event.gif refs in the mobile apps? [19:11:20] (03PS2) 10Gilles: Update gallery image bounding box on svwiki to 150x150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304991 (https://phabricator.wikimedia.org/T113877) [19:11:55] (03CR) 10Gilles: Update gallery image bounding box on svwiki to 150x150 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304991 (https://phabricator.wikimedia.org/T113877) (owner: 10Gilles) [19:12:23] Syntax Error: Couldn't find trailer dictionary [19:15:40] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2565503 (10BBlack) >>! In T100902#2563422, @Nemo_bis wrote: >> I do see lots of legit referer headers. > > ULS uses it, for instance, and [h... [19:21:02] (03PS3) 10Dzahn: add phab1001.eqiad as CNAME for iridium.eqiad [dns] - 10https://gerrit.wikimedia.org/r/305335 [19:23:17] (03CR) 10Dzahn: [C: 032] "we're starting to use this in some places to avoid hardcoding "iridium" which will be removed at some point when it gets reinstalled. the " [dns] - 10https://gerrit.wikimedia.org/r/305335 (owner: 10Dzahn) [19:24:42] (03CR) 10Dzahn: "[radon:~] $ host phab1001.eqiad.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/305335 (owner: 10Dzahn) [19:27:32] (03CR) 10Dzahn: "thanks for the reviews, minor nitpick "done" too" [puppet] - 10https://gerrit.wikimedia.org/r/305277 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [19:27:35] (03PS6) 10Dzahn: phabricator: allow ssh between servers for cluster support [puppet] - 10https://gerrit.wikimedia.org/r/305277 (https://phabricator.wikimedia.org/T137928) [19:39:17] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/3770/" [puppet] - 10https://gerrit.wikimedia.org/r/305277 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [19:39:23] (03PS7) 10Dzahn: phabricator: allow ssh between servers for cluster support [puppet] - 10https://gerrit.wikimedia.org/r/305277 (https://phabricator.wikimedia.org/T137928) [19:39:39] https://grafana.wikimedia.org/dashboard/db/production-logging [19:45:52] !log phab2001 - reenabled puppet temp [19:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:04:41] !log Run initSiteStats maintenance script on mhrwiki and newwiki (T143352) [20:04:42] T143352: Update statistics count on mhr, new - https://phabricator.wikimedia.org/T143352 [20:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:55] greg-g: for new extensions, does deployment to beta happen before or after security review? [20:21:29] ori: after [20:22:32] !log phab2001 disabled phd service again [20:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:24:42] (03PS1) 10Dereckson: Restrict local upload on ar.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305573 (https://phabricator.wikimedia.org/T142450) [20:56:26] wikibugs: hi [20:57:27] hmmm is it going to reply you? [20:58:12] no, it stopped talking [21:00:35] 06Operations, 10Phabricator, 10netops: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2565934 (10Dzahn) [21:03:22] which was gerrit port? [21:03:53] now it only displays the https: ulr :( [21:03:55] *url [21:04:38] Platonides: 29418 [21:04:38] ah [21:04:41] 29418 [21:05:16] I recently switched over to using https for all gerrit pushes though [21:05:33] [url "https://legoktm@gerrit.wikimedia.org/r/"] [21:05:33] insteadOf = "ssh://legoktm@gerrit.wikimedia.org:29418/" [21:05:56] is it possible to push with http? [21:06:00] *https [21:06:05] I didn't know that [21:06:31] yeah, you just have to use the special HTTPS password in your gerrit prefs [21:07:03] I don't think I have set one [21:07:20] actually, I think I haven't visited gerrit preferences in years [21:07:55] RegisteredJul 25, 2016 3:42 AM [21:08:02] pretty sure this is wrong :P [21:08:44] yep, no password [21:18:01] (03PS2) 10Gergő Tisza: Disable wgUseFilePatrol in huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275299 [21:18:40] (03PS3) 10Gergő Tisza: Remove AbuseFilter B/C config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293109 [21:18:45] (03PS4) 10Gergő Tisza: Remove $wgDisableAuthManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303939 [21:23:09] !log Run deleteEqualMessages maintenance script on eswikibooks, eswikiquote, eswikisource, eswiktionary and eswikiversity (T45917) [21:23:10] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [21:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:03] 06Operations, 10Phabricator: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2566051 (10faidon) [21:30:07] PROBLEM - puppet last run on ms-be2022 is CRITICAL: CRITICAL: puppet fail [21:33:21] 06Operations, 10Phabricator: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2565934 (10faidon) There are no ACLs between private-* subnets, irrespective of datacenters (there are very few exceptions). This isn't network-related. I logged in to debug this but som... [21:47:30] 06Operations, 06Community-Tech, 10wikidiff2, 13Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2566119 (10MaxSem) Yes please! Meanwhile, I've upgraded it on beta cluster. [21:49:38] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#1323201 (10Platonides) I think a new ULS version not relying on that should be released before the shutdown, then. [21:53:04] (03PS5) 10Ppchelko: ChangeProp: Update config for the new driver [puppet] - 10https://gerrit.wikimedia.org/r/305414 [21:56:10] (03CR) 10Ppchelko: "Done, made it completely backwards-compatible." [puppet] - 10https://gerrit.wikimedia.org/r/305414 (owner: 10Ppchelko) [21:56:27] RECOVERY - puppet last run on ms-be2022 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [22:02:25] (03PS1) 10Alex Monk: deployment-prep: Move udp2log to deployment-fluorine02 [puppet] - 10https://gerrit.wikimedia.org/r/305587 [22:08:41] (03PS1) 10Rush: tools: sge colletor [puppet] - 10https://gerrit.wikimedia.org/r/305588 (https://phabricator.wikimedia.org/T140999) [22:09:57] Hello, new contributor here... I was tasked with something related to this repo: https://github.com/wikimedia/puppet-cdh/blob/master/manifests/hadoop/users.pp -- I don't see a equivalent repo in Gerrit so I'm bit confused on how I am supposed to send a review? [22:10:00] (03PS1) 10Alex Monk: deployment-prep: Move udp2log to deployment-fluorine02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305589 [22:11:06] 06Operations, 10Phabricator: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2566159 (10Dzahn) >>! In T143363#2566092, @faidon wrote: > iridium-vcs.eqiad.wmnet? What is that? It pings: > It looks like an alias to iridium? > Which is it, /21 or /22? One of them is... [22:11:12] (03PS2) 10Rush: tools: sge colletor set SGE_ROOT [puppet] - 10https://gerrit.wikimedia.org/r/305588 (https://phabricator.wikimedia.org/T140999) [22:12:43] mutante: pretty sure this is the relevant task for that git-ssh/vcs thing: https://phabricator.wikimedia.org/T100519 [22:13:11] (03CR) 10Alex Monk: [C: 032] deployment-prep: Move udp2log to deployment-fluorine02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305589 (owner: 10Alex Monk) [22:13:39] (03Merged) 10jenkins-bot: deployment-prep: Move udp2log to deployment-fluorine02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305589 (owner: 10Alex Monk) [22:14:08] (03CR) 10Alex Monk: "cherry-picked on deployment-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/305587 (owner: 10Alex Monk) [22:14:30] greg-g: ah! thanks [22:15:22] 06Operations, 10Phabricator: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2566183 (10Dzahn) >>! In T143363#2566159, @Dzahn wrote: >>>! In T143363#2566092, @faidon wrote: > >> iridium-vcs.eqiad.wmnet? What is that? It pings: as Greg pointed out that is T100519 [22:16:14] !log krenair@tin Synchronized wmf-config/LabsServices.php: for labs, no-op in prod: https://gerrit.wikimedia.org/r/#/c/305589/ (duration: 00m 56s) [22:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:16:59] (03CR) 10Rush: [C: 032] tools: sge colletor set SGE_ROOT [puppet] - 10https://gerrit.wikimedia.org/r/305588 (https://phabricator.wikimedia.org/T140999) (owner: 10Rush) [22:17:03] (03PS3) 10Rush: tools: sge colletor set SGE_ROOT [puppet] - 10https://gerrit.wikimedia.org/r/305588 (https://phabricator.wikimedia.org/T140999) [22:17:08] (03CR) 10Rush: [V: 032] tools: sge colletor set SGE_ROOT [puppet] - 10https://gerrit.wikimedia.org/r/305588 (https://phabricator.wikimedia.org/T140999) (owner: 10Rush) [22:21:51] (03Abandoned) 10Dzahn: remove nobelium.eqiad.wmnet, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/304113 (https://phabricator.wikimedia.org/T142581) (owner: 10Dzahn) [22:32:38] (03PS1) 10Dzahn: phabricator: don't run phd on inactive server yet [puppet] - 10https://gerrit.wikimedia.org/r/305591 (https://phabricator.wikimedia.org/T137928) [22:33:58] (03CR) 10jenkins-bot: [V: 04-1] phabricator: don't run phd on inactive server yet [puppet] - 10https://gerrit.wikimedia.org/r/305591 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [22:36:13] (03CR) 10Dzahn: "it's alright, but need to also add v6, not in ip6tables yet" [puppet] - 10https://gerrit.wikimedia.org/r/305277 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [22:38:03] (03PS2) 10Dzahn: phabricator: don't run phd on inactive server yet [puppet] - 10https://gerrit.wikimedia.org/r/305591 (https://phabricator.wikimedia.org/T137928) [22:40:07] (03CR) 10Dzahn: [C: 032] "no-op on iridium. stops it on phab2001 http://puppet-compiler.wmflabs.org/3773/" [puppet] - 10https://gerrit.wikimedia.org/r/305591 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [22:40:27] (03PS3) 10Dzahn: phabricator: don't run phd on inactive server yet [puppet] - 10https://gerrit.wikimedia.org/r/305591 (https://phabricator.wikimedia.org/T137928) [22:44:07] !log phab2001 - puppet re-enabled (but phd service stopped, after gerrit 305591) [22:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:00:04] RoanKattouw, ostriches, MaxSem, awight, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160818T2300). [23:00:04] tgr: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:15:55] I can do a self-service SWAT I guess [23:25:33] (03CR) 10Gergő Tisza: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275299 (owner: 10Gergő Tisza) [23:28:01] (03PS3) 10Gergő Tisza: Disable wgUseFilePatrol in huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275299 [23:30:23] (03CR) 10EBernhardson: [C: 031] "makes sense to me." [puppet] - 10https://gerrit.wikimedia.org/r/305519 (https://phabricator.wikimedia.org/T133844) (owner: 10Gehel) [23:31:21] (03CR) 10Gergő Tisza: Disable wgUseFilePatrol in huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275299 (owner: 10Gergő Tisza) [23:31:28] (03CR) 10Gergő Tisza: [C: 032] Disable wgUseFilePatrol in huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275299 (owner: 10Gergő Tisza) [23:31:55] (03Merged) 10jenkins-bot: Disable wgUseFilePatrol in huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275299 (owner: 10Gergő Tisza) [23:36:08] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: SWAT gerrit:275299 Disable wgUseFilePatrol in huwiki (duration: 01m 00s) [23:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:01] (03CR) 10Gergő Tisza: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303939 (owner: 10Gergő Tisza) [23:38:13] (03PS5) 10Gergő Tisza: Remove $wgDisableAuthManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303939 [23:39:45] (03CR) 10Gergő Tisza: [C: 032] Remove $wgDisableAuthManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303939 (owner: 10Gergő Tisza) [23:40:15] (03Merged) 10jenkins-bot: Remove $wgDisableAuthManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303939 (owner: 10Gergő Tisza) [23:41:53] !log mwscript sql.php --wiki=enwikivoyage /srv/mediawiki/php/extensions/PageAssessments/db/addProjectsTable.sql [23:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:02] !log mwscript sql.php --wiki=enwikivoyage /srv/mediawiki/php/extensions/PageAssessments/db/addReviewsTable.sql [23:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:40] (03PS4) 10Dereckson: Remove AbuseFilter B/C config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293109 (owner: 10Gergő Tisza) [23:42:59] (03PS5) 10Dereckson: Remove AbuseFilter B/C config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293109 (owner: 10Gergő Tisza) [23:43:23] (03CR) 10Dereckson: "PS4: removed one of the duplicate \n at EOF" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293109 (owner: 10Gergő Tisza) [23:43:40] Hi. [23:43:54] !log tgr@tin Synchronized wmf-config/wikitech.php: SWAT gerrit:303939 Remove $wgDisableAuthManager (duration: 00m 49s) [23:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:45:46] !log tgr@tin Synchronized wmf-config/CommonSettings.php: SWAT gerrit:303939 Remove $wgDisableAuthManager (duration: 00m 48s) [23:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:46:56] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: SWAT gerrit:303939 Remove $wgDisableAuthManager (duration: 00m 49s) [23:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:36] (03CR) 10Gergő Tisza: [C: 032] Remove AbuseFilter B/C config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293109 (owner: 10Gergő Tisza) [23:48:03] (03Merged) 10jenkins-bot: Remove AbuseFilter B/C config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293109 (owner: 10Gergő Tisza) [23:49:03] tgr: could you deploy https://gerrit.wikimedia.org/r/#/c/305558/ too afterwards please (or ping me when you're done if you prefer I do it)? [23:49:36] sure [23:49:49] (03PS2) 10Gergő Tisza: Set timezone to Europe/Ljubljana on sl. projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305558 (https://phabricator.wikimedia.org/T142701) (owner: 10Dereckson) [23:50:48] Thanks. [23:52:04] !log tgr@tin Synchronized wmf-config/abusefilter.php: SWAT gerrit:293109 Remove AbuseFilter B/C config (duration: 00m 49s) [23:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:01] (03CR) 10Gergő Tisza: [C: 032] Set timezone to Europe/Ljubljana on sl. projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305558 (https://phabricator.wikimedia.org/T142701) (owner: 10Dereckson) [23:56:26] (03Merged) 10jenkins-bot: Set timezone to Europe/Ljubljana on sl. projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305558 (https://phabricator.wikimedia.org/T142701) (owner: 10Dereckson) [23:58:18] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: SWAT gerrit:305558 Set timezone to Europe/Ljubljana on sl. projects (duration: 00m 49s) [23:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:58:29] Dereckson: ^ [23:58:37] Thanks for the deploy. [23:59:04] 06Operations, 10ArticlePlaceholder, 10Traffic, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2551996 (10DaBPunkt) >>! In T142944#2560827, @hoo wrote: > For this, we also desire to get the placeholders into search engines, to... [23:59:55] Tested, works fine. [23:59:57] (03CR) 10Faidon Liambotis: [C: 04-2] "As explained on IRC, no need for that :)" [puppet] - 10https://gerrit.wikimedia.org/r/305429 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn)