[00:03:15] RECOVERY - High lag on wdqs1004 is OK: OK: Less than 30.00% above the threshold [600.0] [00:04:15] RECOVERY - High lag on wdqs1005 is OK: OK: Less than 30.00% above the threshold [600.0] [00:11:46] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [00:15:05] PROBLEM - MegaRAID on tungsten is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [00:15:07] ACKNOWLEDGEMENT - MegaRAID on tungsten is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T179129 [00:15:11] 10Operations, 10ops-eqiad: Degraded RAID on tungsten - https://phabricator.wikimedia.org/T179129#3714079 (10ops-monitoring-bot) [00:23:05] PROBLEM - puppet last run on thumbor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:24:23] 10Operations, 10Patch-For-Review: Update prod SSH key for Michael Holloway (mholloway-shell) - https://phabricator.wikimedia.org/T178897#3714096 (10Mholloway) 05Open>03Resolved Works! Thanks for merging. [00:53:05] RECOVERY - puppet last run on thumbor2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:53:58] (03PS1) 10BryanDavis: git-sync-upstream: formatting and comment tweaks [puppet] - 10https://gerrit.wikimedia.org/r/386763 [00:54:14] (03CR) 10BryanDavis: [C: 031] git-sync-upstream: rewrite in python [puppet] - 10https://gerrit.wikimedia.org/r/386318 (https://phabricator.wikimedia.org/T177944) (owner: 10Andrew Bogott) [00:56:06] (03CR) 10BryanDavis: [C: 031] "Tested together with parent patch on a Cloud VPS project local puppetmaster. Everything looks good. Local commits were preserved properly " [puppet] - 10https://gerrit.wikimedia.org/r/386331 (https://phabricator.wikimedia.org/T177944) (owner: 10Andrew Bogott) [01:04:15] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:04:25] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509066258 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4021256 keys, up 4 minutes 16 seconds - replication_delay is 1509066258 [01:04:35] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481 [01:04:55] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6480 [01:05:25] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8719233 keys, up 5 minutes 17 seconds - replication_delay is 0 [01:05:25] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4016199 keys, up 5 minutes 17 seconds - replication_delay is 0 [01:05:35] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4013785 keys, up 5 minutes 24 seconds - replication_delay is 0 [01:05:46] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4014273 keys, up 5 minutes 38 seconds - replication_delay is 0 [01:27:05] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 31.03% of data above the critical threshold [1800.0] [01:27:35] PROBLEM - High lag on wdqs1005 is CRITICAL: CRITICAL: 37.93% of data above the critical threshold [1800.0] [01:29:35] PROBLEM - High lag on wdqs1004 is CRITICAL: CRITICAL: 34.48% of data above the critical threshold [1800.0] [02:02:35] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [02:05:35] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [02:45:04] 10Operations, 10Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#3714137 (10Dzahn) @Kaythaney I was now able to read the Wordpress ticket. I saw they confirmed they don't need a zone file and instead just need legal (MarkMonitor) to switch the DNS servers to Wordp... [02:50:15] PROBLEM - restbase endpoints health on xenon is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [02:51:06] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [02:58:15] PROBLEM - Check Varnish expiry mailbox lag on cp1067 is CRITICAL: CRITICAL: expiry mailbox lag is 2051803 [03:02:17] !log cp1067, cp4026 - backend restarts, mailbox lag [03:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:45] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [03:06:15] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [03:07:15] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [03:08:15] RECOVERY - Check Varnish expiry mailbox lag on cp1067 is OK: OK: expiry mailbox lag is 0 [03:14:15] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:16:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:26:35] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.70 seconds [03:27:02] (03CR) 10BBlack: [C: 04-1] "I think this needs some more to it than just adding the IP here, right? Like adding codfw to the sites list in the git-ssh service further" [puppet] - 10https://gerrit.wikimedia.org/r/355869 (https://phabricator.wikimedia.org/T164810) (owner: 10Dzahn) [03:37:14] (03Abandoned) 10Dzahn: LVS/phabricator: add git-ssh in codfw [puppet] - 10https://gerrit.wikimedia.org/r/355869 (https://phabricator.wikimedia.org/T164810) (owner: 10Dzahn) [03:39:55] RECOVERY - High lag on wdqs1004 is OK: OK: Less than 30.00% above the threshold [600.0] [03:39:56] RECOVERY - High lag on wdqs1005 is OK: OK: Less than 30.00% above the threshold [600.0] [03:40:32] yeah I guess by now it's not even an easy rebase heh [03:45:35] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [04:00:15] 10Operations: upload problem for parsoid release - https://phabricator.wikimedia.org/T179134#3714186 (10Dzahn) [04:01:45] 10Operations: upload problem for parsoid release - https://phabricator.wikimedia.org/T179134#3714200 (10Dzahn) [04:01:57] 10Operations, 10Release-Engineering-Team: upload problem for parsoid release - https://phabricator.wikimedia.org/T179134#3714186 (10Dzahn) [04:07:36] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 278.20 seconds [04:28:36] :P [04:42:36] (03PS1) 10Dzahn: releases/upload: set correct upload server as active_server [puppet] - 10https://gerrit.wikimedia.org/r/386775 (https://phabricator.wikimedia.org/T179134) [04:43:08] (03CR) 10jerkins-bot: [V: 04-1] releases/upload: set correct upload server as active_server [puppet] - 10https://gerrit.wikimedia.org/r/386775 (https://phabricator.wikimedia.org/T179134) (owner: 10Dzahn) [04:57:22] (03PS2) 10Dzahn: releases/upload: set correct upload server as active_server [puppet] - 10https://gerrit.wikimedia.org/r/386775 (https://phabricator.wikimedia.org/T179134) [04:57:53] (03CR) 10jerkins-bot: [V: 04-1] releases/upload: set correct upload server as active_server [puppet] - 10https://gerrit.wikimedia.org/r/386775 (https://phabricator.wikimedia.org/T179134) (owner: 10Dzahn) [05:02:18] (03PS3) 10Dzahn: releases/upload: set correct upload server as active_server [puppet] - 10https://gerrit.wikimedia.org/r/386775 (https://phabricator.wikimedia.org/T179134) [05:32:05] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/8493/naos.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/386775 (https://phabricator.wikimedia.org/T179134) (owner: 10Dzahn) [05:33:36] 10Operations, 10Release-Engineering-Team, 10Patch-For-Review: upload problem for parsoid release - https://phabricator.wikimedia.org/T179134#3714378 (10Dzahn) http://puppet-compiler.wmflabs.org/8493/naos.codfw.wmnet/ http://puppet-compiler.wmflabs.org/8493/tin.eqiad.wmnet/ ``` - identityfile ~/.ssh/id... [05:34:03] (03PS4) 10Dzahn: releases/upload: set correct upload server as active_server [puppet] - 10https://gerrit.wikimedia.org/r/386775 (https://phabricator.wikimedia.org/T179134) [05:38:14] (03PS1) 10Jayprakash12345: Change the af.wiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386777 (https://phabricator.wikimedia.org/T178824) [05:38:32] (03CR) 10Dzahn: [C: 032] releases/upload: set correct upload server as active_server [puppet] - 10https://gerrit.wikimedia.org/r/386775 (https://phabricator.wikimedia.org/T179134) (owner: 10Dzahn) [05:41:57] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386778 [05:42:01] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386778 [05:43:45] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386778 (owner: 10Marostegui) [05:45:24] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386778 (owner: 10Marostegui) [05:46:24] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386778 (owner: 10Marostegui) [05:46:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 - T174509 (duration: 00m 50s) [05:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:48] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [05:52:03] (03Draft2) 10Jayprakash12345: Enable ShortUrl and WikiLove Extension on pa.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 [05:52:42] (03PS3) 10Jayprakash12345: Enable ShortUrl and WikiLove Extension on pa.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) [05:53:00] !log uploaded parsoid_0.8.0 to releases.wikimedia.org (for Subbu - T179134) [05:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:07] T179134: upload problem for parsoid release - https://phabricator.wikimedia.org/T179134 [05:53:52] 10Operations, 10Release-Engineering-Team, 10Patch-For-Review: upload problem for parsoid release - https://phabricator.wikimedia.org/T179134#3714448 (10Dzahn) ``` root@tin:/tmp# deb-upload /tmp/parsoid_0.8.0all_amd64.changes Trying to upload package to releases1001.eqiad.wmnet Uploading to releases1001.eqiad... [05:54:17] 10Operations, 10Release-Engineering-Team, 10Patch-For-Review: upload problem for parsoid release - https://phabricator.wikimedia.org/T179134#3714449 (10Dzahn) 05Open>03Resolved a:03Dzahn [05:55:03] !log dbstore1001: convert itwiki.pagelinks to TokuDB - T162789 [05:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:09] T162789: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789 [05:55:41] 10Operations, 10DBA, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3714454 (10Marostegui) I have optimize ruwiki.recentchanges on dbstore1001. [05:55:48] subbu: parsoid 0.8.0 uploaded - all details in the ticket above [05:56:44] (03PS2) 10Giuseppe Lavagetto: First version of scap deployment of docker-pkg [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/386641 [05:59:44] !log Stop MySQL on db2084 to copy s5 to db2086 - T178359 [05:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:50] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:03:33] (03PS1) 10Marostegui: mariadb: Convert db2086 to multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/386780 (https://phabricator.wikimedia.org/T178359) [06:04:00] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Convert db2086 to multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/386780 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:04:02] (03CR) 10Dzahn: "do we really want _only_ HTTP and not HTTP_LDAP where it first checks against the HTTP pass and if that doesn't match tries LDAP?" [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [06:05:44] (03PS2) 10Marostegui: mariadb: Convert db2086 to multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/386780 (https://phabricator.wikimedia.org/T178359) [06:06:16] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Convert db2086 to multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/386780 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:07:25] gah [06:07:27] i need coffee [06:08:03] (03PS3) 10Marostegui: mariadb: Convert db2086 to multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/386780 (https://phabricator.wikimedia.org/T178359) [06:08:27] (03PS1) 10Marostegui: s5,s7.hosts: Add db2086 to s5 and s7 [software] - 10https://gerrit.wikimedia.org/r/386781 (https://phabricator.wikimedia.org/T178359) [06:08:56] (03CR) 10Jayprakash12345: "@Dereckson Is there need to create sql tables for wiki love extension. If So, then Can you create this one?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) (owner: 10Jayprakash12345) [06:10:21] (03CR) 10Marostegui: [C: 032] s5,s7.hosts: Add db2086 to s5 and s7 [software] - 10https://gerrit.wikimedia.org/r/386781 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:11:10] (03Merged) 10jenkins-bot: s5,s7.hosts: Add db2086 to s5 and s7 [software] - 10https://gerrit.wikimedia.org/r/386781 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:11:43] (03CR) 10Marostegui: [C: 032] "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler02/8494/" [puppet] - 10https://gerrit.wikimedia.org/r/386780 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:16:23] (03CR) 10Dzahn: [C: 04-1] "the host name of the active gerrit server is currently cobalt" [puppet] - 10https://gerrit.wikimedia.org/r/386257 (owner: 10Paladox) [06:24:57] (03PS4) 10Giuseppe Lavagetto: utils: remove `linter` script [puppet] - 10https://gerrit.wikimedia.org/r/381935 [06:25:00] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] utils: remove `linter` script [puppet] - 10https://gerrit.wikimedia.org/r/381935 (owner: 10Giuseppe Lavagetto) [06:28:06] PROBLEM - ores on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8081: Connection refused [06:30:08] (03CR) 10Dzahn: [C: 031] "ignore my last comment - it's actually right, we want to use 208.80.154.85 (gerrit) and not 208.80.154.81 (cobalt) and we are talking abou" [puppet] - 10https://gerrit.wikimedia.org/r/386257 (owner: 10Paladox) [06:31:05] (03CR) 10Marostegui: "Thanks Anomie - I will deploy this on codfw then and leave it there for a few days" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385152 (https://phabricator.wikimedia.org/T178553) (owner: 10Marostegui) [06:40:15] RECOVERY - ores on scb1001 is OK: HTTP OK: HTTP/1.0 200 OK - 3666 bytes in 0.015 second response time [06:43:08] (03PS1) 10Giuseppe Lavagetto: ordered_yaml: use yaml instead of zaml [puppet] - 10https://gerrit.wikimedia.org/r/386783 [06:54:11] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [07:16:25] paladox: hi, do you know/remember what we have to do on https://phabricator.wikimedia.org/T166322 i don't :) [07:21:47] @seen hashar [07:21:48] mutante: Last time I saw hashar they were changing the nickname to hasharAway, but hasharAway is no longer in channel because he quit the network 11:25:30.8693330 ago. The nick change was done in #wikimedia-traffic at 10/26/2017 8:49:16 AM (22h32m31s ago) [07:22:15] (03PS2) 10Elukey: Disable HDFS httpfs daemon, we don't use it [puppet] - 10https://gerrit.wikimedia.org/r/386684 (owner: 10Ottomata) [07:23:02] paladox: nevermind, i see now. i took the ticket to link it to the general "remove ganglia exim stats" from everything [07:24:11] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:24:56] 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3714745 (10Dzahn) once https://gerrit.wikimedia.org/r/#/c/382916/ is merged this should resolve this ticket as well [07:28:34] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8496/analytics1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/386684 (owner: 10Ottomata) [07:28:42] PROBLEM - Check Varnish expiry mailbox lag on cp4023 is CRITICAL: CRITICAL: expiry mailbox lag is 2030831 [07:31:42] (03PS3) 10Elukey: Allow specifying --group to sql [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man) [07:38:13] (03CR) 10Elukey: [C: 032] "This would have been part of yesterday's PUPPET SWAT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171026T1600) but di" [puppet] - 10https://gerrit.wikimedia.org/r/384951 (owner: 10Hoo man) [07:42:56] (03CR) 10Dzahn: "@filippo I don't see any exim related metrics in grafana-admin. Should this be blocked by waiting for that?" [puppet] - 10https://gerrit.wikimedia.org/r/382916 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [07:48:49] (03PS6) 10Elukey: mediawiki::maintenace::wikidata: log Wikidata dispatchers [puppet] - 10https://gerrit.wikimedia.org/r/386591 (https://phabricator.wikimedia.org/T179060) (owner: 10Hoo man) [07:55:44] (03PS7) 10Elukey: mediawiki::maintenace::wikidata: log Wikidata dispatchers [puppet] - 10https://gerrit.wikimedia.org/r/386591 (https://phabricator.wikimedia.org/T179060) (owner: 10Hoo man) [08:04:23] Did anything else happen regarding yesterday's problems? I can't even find a ticket about it [08:04:39] hoo: o/ [08:04:48] I am testing the patch for terbium [08:04:54] thanks :) [08:04:54] should be completed in a bit [08:05:06] I saw that you decoupled it from my refactoring [08:05:08] 10Operations, 10Ops-Access-Requests, 10DBA, 10cloud-services-team (Kanban): Access to raw database tables on labsdb* for wmcs-admin users - https://phabricator.wikimedia.org/T178128#3714754 (10Marostegui) Not sure I am getting your last point right, @madhuvishy, taking what  @bd808 wrote on the original t... [08:08:27] (03CR) 10Filippo Giunchedi: "> @filippo I don't see any exim related metrics in grafana-admin." [puppet] - 10https://gerrit.wikimedia.org/r/382916 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [08:08:50] hoo: yes sorry probably yesterday evening I didn't explain myself in a good way.. I wanted only that the /var/log/wikidata dir was ensured, not file { $dispatch_log_file: etc.. [08:11:08] !log Drop redundant indexes from pagelinks and templatelinks on s3 wikis only for dbstore2001 and dbstore2002 - T174509 [08:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:15] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [08:15:20] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8498/terbium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/386591 (https://phabricator.wikimedia.org/T179060) (owner: 10Hoo man) [08:19:07] hoo: crontab deployed on terbium :) [08:19:37] thanks a lot! :) [08:20:42] :) [08:23:16] elukey: https://gerrit.wikimedia.org/r/#/c/386591/6..7/modules/mediawiki/manifests/maintenance/wikidata.pp oh… that kill substitution for the initial $$ (but not the final one) [08:25:50] * elukey feels stupid [08:26:06] hahahah see what happens when you don't drink enough coffee :D [08:26:11] fixing it [08:31:47] (03PS1) 10Elukey: mediawiki::maintenance::wikidata: fix variable substitution [puppet] - 10https://gerrit.wikimedia.org/r/386785 [08:31:59] hoo: --^ [08:32:57] Is it ok to not quote the dollars in there? [08:33:20] (Not sure how exactly puppet determines what's a var and what not) [08:33:49] running the compiler to figure out what's best [08:34:53] so the last one is rendered like echo "$$: Dispatcher exited with $?" in the crontab [08:34:59] so it is surely good [08:35:12] curious about my version [08:35:13] Yeah, checked that as well [08:36:08] seems fine as well https://puppet-compiler.wmflabs.org/compiler02/8500/terbium.eqiad.wmnet/ [08:37:08] (03PS2) 10Elukey: mediawiki::maintenance::wikidata: fix variable substitution [puppet] - 10https://gerrit.wikimedia.org/r/386785 [08:41:28] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8501/terbium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/386785 (owner: 10Elukey) [08:43:51] 10Operations, 10Services (watching), 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839#3704693 (10fgiunchedi) Indeed the source was present only as `.dsc`, I've imported it into https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/jvm-tools [08:43:56] hoo: should be ok now [08:44:36] crontab looks good… let's wait for the logs [08:45:08] 20299: Starting dispatcher [08:45:31] Thanks so much :) This is going to be really helpful (and if only to rule out some problem causes) [08:46:03] super :) [08:55:51] 10Operations, 10DBA, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3714894 (10Marostegui) itwiki.pagelinks has been converted to TokuDB. [08:56:21] !log Optimize table commonswiki.recentchanges on dbstore1001 - T162789 [08:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:28] T162789: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789 [09:01:18] (03CR) 10Ema: [C: 032] VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) (owner: 10Ema) [09:01:23] (03PS15) 10Ema: VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) [09:01:25] (03CR) 10Ema: [V: 032 C: 032] VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) (owner: 10Ema) [09:19:15] (03CR) 10Hashar: "I don't think we need the host_aliases at all. It has been introduced via https://gerrit.wikimedia.org/r/#/c/281706/ but really I don't t" [puppet] - 10https://gerrit.wikimedia.org/r/386257 (owner: 10Paladox) [09:30:58] (03CR) 10Dzahn: [C: 031] "agree, we can just drop the alias line, probably should to get rid of cobalt references, especially if we make the gerrit ssh only listen " [puppet] - 10https://gerrit.wikimedia.org/r/386257 (owner: 10Paladox) [09:37:12] (03PS1) 10Gehel: wdqs: remove PrintPLAB from GC logging [puppet] - 10https://gerrit.wikimedia.org/r/386791 (https://phabricator.wikimedia.org/T175919) [09:40:22] PROBLEM - HHVM rendering on mw2127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:12] RECOVERY - HHVM rendering on mw2127 is OK: HTTP OK: HTTP/1.1 200 OK - 83297 bytes in 0.328 second response time [09:53:30] (03PS1) 10Dzahn: gerrit: sshd listen on specific IP, v4 and v6 [puppet] - 10https://gerrit.wikimedia.org/r/386793 [09:57:59] (03PS2) 10Dzahn: gerrit: sshd listen on specific IP, v4 and v6 [puppet] - 10https://gerrit.wikimedia.org/r/386793 [10:06:15] (03CR) 10Hashar: [C: 031] "Seems good. Might want to pass it through the puppet compiler?" [puppet] - 10https://gerrit.wikimedia.org/r/386793 (owner: 10Dzahn) [10:15:51] (03PS1) 10Volans: Cumin: fix wmf-style violations [puppet] - 10https://gerrit.wikimedia.org/r/386799 [10:15:58] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/8502/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/386793 (owner: 10Dzahn) [10:21:33] (03CR) 10Volans: "Compiler looks sane to me:" [puppet] - 10https://gerrit.wikimedia.org/r/386799 (owner: 10Volans) [10:24:12] (03CR) 10Paladox: gerrit: sshd listen on specific IP, v4 and v6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386793 (owner: 10Dzahn) [10:24:15] (03CR) 10Paladox: [C: 04-1] gerrit: sshd listen on specific IP, v4 and v6 [puppet] - 10https://gerrit.wikimedia.org/r/386793 (owner: 10Dzahn) [10:25:41] (03CR) 10Dzahn: [C: 031] gerrit: sshd listen on specific IP, v4 and v6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386793 (owner: 10Dzahn) [10:34:40] (03CR) 10Filippo Giunchedi: [C: 031] uwsgi: Use systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/386620 (owner: 10Muehlenhoff) [10:41:03] (03CR) 10Jayprakash12345: "And also for ShortUrl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386779 (https://phabricator.wikimedia.org/T178919) (owner: 10Jayprakash12345) [10:44:09] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 2 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715032 (10hoo) [10:44:21] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 2 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715052 (10hoo) [10:47:06] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715061 (10Ladsgroup) [10:56:38] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe, 10User-fgiunchedi: Improve monitoring of the Kubernetes clusters - https://phabricator.wikimedia.org/T177395#3715066 (10akosiaris) [10:59:46] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe, 10User-fgiunchedi: Improve monitoring of the Kubernetes clusters - https://phabricator.wikimedia.org/T177395#3715073 (10akosiaris) p:05Triage>03Normal Looks like a valid path forward is to expose the 3 /metrics endpoints, that is the API serv... [11:03:33] !log mobrovac@tin Started deploy [citoid/deploy@ff63420]: Update dependencies [11:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:30] !log Optimize cebwiki.templatelinks on db1103 - T174509 [11:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:37] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [11:06:52] !log mobrovac@tin Finished deploy [citoid/deploy@ff63420]: Update dependencies (duration: 03m 18s) [11:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:50] (03CR) 10Paladox: [C: 04-1] gerrit: sshd listen on specific IP, v4 and v6 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/386793 (owner: 10Dzahn) [11:24:23] (03PS3) 10Dzahn: gerrit: sshd listen on specific IP, v4 and v6 [puppet] - 10https://gerrit.wikimedia.org/r/386793 [11:27:31] (03CR) 10Paladox: [C: 031] gerrit: sshd listen on specific IP, v4 and v6 [puppet] - 10https://gerrit.wikimedia.org/r/386793 (owner: 10Dzahn) [11:29:26] (03PS6) 10Rush: openstack: refactor deployment specific puppetmaster code [puppet] - 10https://gerrit.wikimedia.org/r/386612 (https://phabricator.wikimedia.org/T171494) [11:32:20] jouncebot: next [11:32:20] In 73 hour(s) and 27 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171030T1300) [11:33:26] chasemp: i would like to sneak in a Gerrit restart, i see you are in the middle of rebasing, don't want to snipe you [11:33:40] mutante: go for it [11:33:56] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715115 (10Ladsgroup) @hoo and I sampled x-cache from wberequest dataset for the hour of 19 yesterday for two... [11:36:24] !log cp4023: varnish-backend-restart for lag [11:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:16] (03CR) 10Dzahn: [C: 032] gerrit: sshd listen on specific IP, v4 and v6 [puppet] - 10https://gerrit.wikimedia.org/r/386793 (owner: 10Dzahn) [11:38:22] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/8504/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/386793 (owner: 10Dzahn) [11:38:27] (03PS4) 10Dzahn: gerrit: sshd listen on specific IP, v4 and v6 [puppet] - 10https://gerrit.wikimedia.org/r/386793 [11:38:27] chasemp: thanks, on it [11:38:42] RECOVERY - Check Varnish expiry mailbox lag on cp4023 is OK: OK: expiry mailbox lag is 0 [11:39:52] !log gerrit restart to apply gerrit:386793 is imminent [11:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:29] !log gerrit back - maintenance over [11:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:40] paladox: ^ all good? [11:42:02] ssh works [11:42:06] now we need to check ci [11:42:13] tcp6 0 0 2620:0:861:3:208::29418 :::* [11:42:19] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/386793 (owner: 10Dzahn) [11:42:22] tcp6 0 0 208.80.154.85:29418 :::* [11:42:32] https://integration.wikimedia.org/zuul/ [11:42:34] yep it works [11:42:42] as configured :) [11:42:43] yay [11:42:47] :) [11:43:10] chasemp: all clear :) [11:43:22] (03PS1) 10Marostegui: db-codfw.php: Pool db2084 as multi-instance host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386810 (https://phabricator.wikimedia.org/T178553) [11:48:33] mutante: so changed ? :) [11:48:43] hashar yep [11:48:46] and ci works now [11:48:49] \o/ [11:49:22] and from my connection as well [11:49:54] yep. [11:50:06] hashar: :) yep [11:51:49] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715170 (10hoo) This has some potentially interesting patterns: `watchlist, recentchanges, contributions, log... [11:54:43] (03PS2) 10Hashar: Migrate contint::firewall to a profile [puppet] - 10https://gerrit.wikimedia.org/r/385472 [11:59:21] (03CR) 10Zoranzoki21: [C: 031] Change the af.wiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386777 (https://phabricator.wikimedia.org/T178824) (owner: 10Jayprakash12345) [12:00:23] 10Operations, 10Pybal, 10Traffic, 10netops, 10Patch-For-Review: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3715204 (10elukey) Found a chrome bug that seems interesting (although not really recent): https://bugs.chromium.org/p/chromium/issues/detail?id=11... [12:01:56] (03PS1) 10Hashar: Drop unused role::ci::jenkins_access [puppet] - 10https://gerrit.wikimedia.org/r/386811 [12:03:15] !log execute systemctl reset-failed kafka-mirror-main-eqiad_to_jumbo-eqiad.service on kafka-jumbo hosts (old unit not deployed anymore) [12:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:29] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715032 (10Marostegui) From those two masters's (s4 and s5) graphs, we can see that whatever happened, happene... [12:08:17] (03PS1) 10Hashar: contint: migrate castor to a profile [puppet] - 10https://gerrit.wikimedia.org/r/386812 [12:08:42] (03CR) 10Hashar: "solely for labs, specially castor02.integration.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/386812 (owner: 10Hashar) [12:08:45] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715242 (10thiemowmde) p:05Triage>03High The tasks description talks about ongoing investigation. Is there... [12:10:00] 10Operations, 10DBA, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3715248 (10Marostegui) >>! In T162789#3699022, @jcrespo wrote: > On screen at dbstore1001 CC @marostegui : > > root@dbstore1001[cebwiki]> ALTER TABLE t... [12:10:12] !log Optimize commonswiki.templatelinks on dbstore1001 - T162789 [12:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:19] T162789: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789 [12:12:30] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715253 (10hoo) p:05High>03Triage >>! In T179156#3715242, @thiemowmde wrote: > The tasks description talks... [12:14:56] (03PS1) 10Hashar: contint: migrate publisher to a profile [puppet] - 10https://gerrit.wikimedia.org/r/386813 [12:15:17] (03PS7) 10Rush: openstack: refactor deployment specific puppetmaster code [puppet] - 10https://gerrit.wikimedia.org/r/386612 (https://phabricator.wikimedia.org/T171494) [12:15:32] (03CR) 10Hashar: "That is for integration-publishing.integration.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/386813 (owner: 10Hashar) [12:15:47] hoo: I think it was the wikidata thing. either that or it spuriously went away on its own <1h after the wikidata rollback [12:15:55] but, yes, evidence is scant [12:26:08] bblack: Hm… so no indication of the underlying cause at all :( [12:26:27] I'll look at the full Wikidata wmf4 -> wmf5 diff then and see whether there's scary stuff there [12:26:32] but it could be anything really [12:26:41] maybe just some user doing crazy things [12:26:51] yes, all things are possible [12:27:19] but really, I don't know why we didn't look at / revert those changes at the beggining. they line up so well with the initial fatals starting, etc... [12:28:12] and the final cache backend restart wasn't going to change anything on its own, we'd done that a billion times during those hours [12:30:41] I think the wikidata revert solved the underlying primary issue, but the currently-affected cache was still backlogged on conns/reqs and had mailbox lag, etc... it took one more cache move to get past it afterwards. [12:31:00] (or waiting long might've worked too, but who knows) [12:31:06] s/long/longer/ [12:32:00] anyways, I'll move to the task :) [12:34:33] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715319 (10BBlack) Copying this in from etherpad (this is less awful than 6 hours of raw IRC+SAL logs, but sti... [12:35:17] 10Operations, 10JobRunner-Service, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10User-Joe: Wikibase: Increase batch size for HTMLCacheUpdateJobs triggered by repo changes. - https://phabricator.wikimedia.org/T178810#3715324 (10thiemowmde) [12:35:29] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3715323 (10thiemowmde) [12:35:34] 10Operations, 10JobRunner-Service, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10User-Joe: Wikibase: Increase batch size for HTMLCacheUpdateJobs triggered by repo changes. - https://phabricator.wikimedia.org/T178810#3703558 (10thiemowmde) p:05Triage>03Low [12:35:37] (03PS1) 10Hashar: contint: pass settings to labs Docker daemon [puppet] - 10https://gerrit.wikimedia.org/r/386819 (https://phabricator.wikimedia.org/T179160) [12:37:08] (03CR) 10Hashar: [V: 031 C: 031] "Applied on the CI puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/386819 (https://phabricator.wikimedia.org/T179160) (owner: 10Hashar) [12:37:29] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715335 (10BBlack) My gut instinct remains what it was at the end of the log above. I think something in the... [12:37:51] akosiaris: I could use a dummy profile::ci::docker::settings for labs :] That was introduced for the CI master a few days ago and ended up breaking puppet on labs. https://gerrit.wikimedia.org/r/#/c/386819/ [12:41:20] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3715336 (10thiemowmde) [12:42:53] hashar: aha... forgot about the labs hosts running that profile [12:43:16] (03CR) 10Alexandros Kosiaris: [C: 032] contint: pass settings to labs Docker daemon [puppet] - 10https://gerrit.wikimedia.org/r/386819 (https://phabricator.wikimedia.org/T179160) (owner: 10Hashar) [12:43:49] done [12:44:17] (03PS22) 10Paladox: gerrit: Ajust scap files (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [12:44:19] (03PS20) 10Paladox: Gerrit: Upgrading gerrit to 2.14.6-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [12:47:49] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 3 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715381 (10BBlack) Unless anyone objects, I'd like to start with reverting our emergency `varnish max_connecti... [12:49:25] (03PS1) 10BBlack: Revert "cache_text: raise MW connection limits to 10K" [puppet] - 10https://gerrit.wikimedia.org/r/386824 (https://phabricator.wikimedia.org/T179156) [12:52:51] akosiaris: wonderful :) [12:56:30] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715432 (10hoo) I think I found the root cuase now, seems it's actually related to the WikibaseQualityConstra... [12:56:42] bblack: Think we got the root cause now [12:59:09] We could revert just the extension to wmf4 and put everything back into the pre-issue state? [12:59:23] What do you think? Doable? Or overly dangerous? [12:59:34] not sure if I would do that on a Friday [13:00:07] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715446 (10BBlack) >>! In T179156#3715432, @hoo wrote: > I think I found the root cuase now, seems it's actu... [13:00:37] (03PS8) 10Rush: openstack: refactor deployment specific puppetmaster code [puppet] - 10https://gerrit.wikimedia.org/r/386612 (https://phabricator.wikimedia.org/T171494) [13:00:54] (03PS2) 10Giuseppe Lavagetto: ordered_yaml: use yaml instead of zaml [puppet] - 10https://gerrit.wikimedia.org/r/386783 (https://phabricator.wikimedia.org/T179076) [13:01:21] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715447 (10hoo) >>! In T179156#3715446, @BBlack wrote: > >>>! In T179156#3715432, @hoo wrote: >> I think I fo... [13:02:05] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/8508" [puppet] - 10https://gerrit.wikimedia.org/r/386783 (https://phabricator.wikimedia.org/T179076) (owner: 10Giuseppe Lavagetto) [13:02:29] (03PS9) 10Rush: openstack: refactor deployment specific puppetmaster code [puppet] - 10https://gerrit.wikimedia.org/r/386612 (https://phabricator.wikimedia.org/T171494) [13:03:01] well, normally I would try to do nothing scary on a Friday [13:03:04] 10Operations, 10Puppet, 10User-Joe: puppet4: Could not find declared class mariadb::groups at /etc/puppet/modules/role/manifests/mariadb/dbstore_multiinstance.pp:16:5 - https://phabricator.wikimedia.org/T179161#3715453 (10herron) [13:03:14] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652273 (10herron) [13:03:20] but I think we have sufficient unknowness going on here that it's worth unreverted some things to narrow down the cause on a Friday [13:03:42] that is a good point indeed [13:03:44] I think my sense of time must wake up later than the rest of me [13:03:56] it's always my tenses that are wrong in these early sentences lol [13:04:06] my only point is, are we stable now? [13:04:19] (03PS3) 10Elukey: role::mediawiki::jobrunner: inc runners for refreshLinks/hmtlCacheUpdate [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) [13:04:35] we seem to be stable now, and we think we know which of the 4-5 reverts yesterday caused the stability [13:05:12] you guys know more than me ofc, my point is, if we revert things, can we cause instability again and/or during the weekend? [13:05:12] the rest of those code reverts, may cause more harm than good if left in reverted state, or at best be annoyingly+pointlessly reverted [13:05:21] ah, i see [13:05:43] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/eventlogging/analytics/server.pp:27:29 - https://phabricator.wikimedia.org/T179162#3715478 (10herron) [13:05:46] (03PS4) 10Elukey: role::mediawiki::jobrunner: inc runners for refreshLinks/hmtlCacheUpdate [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) [13:05:49] I couldn't say, I don't know the reverted code or its status well enough [13:06:05] but I think if we re-trigger the problem during the day today, it should be pretty obvious [13:06:32] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652273 (10herron) [13:06:55] [well, with the caveat that the varnish hack revert needs to happen first, to make sure it's obvious] [13:07:31] maybe wait for greg-g to get his coffee and chime in on directions here? [13:08:30] 10Operations, 10Puppet, 10User-Joe: Error while evaluating a Function Puppet4: Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/graphite/alerts.pp:12:21 - https://phabricator.wikimedia.org/T179163#3715495 (10herron) [13:08:32] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652273 (10herron) [13:09:22] PROBLEM - Nginx local proxy to apache on mw2216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:12] RECOVERY - Nginx local proxy to apache on mw2216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.191 second response time [13:10:45] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/webperf.pp:17:22 - https://phabricator.wikimedia.org/T179164#3715516 (10herron) [13:10:50] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652273 (10herron) [13:11:12] (03CR) 1020after4: [C: 031] "@thcipriani: The revised version of D774 was merged, so this is unblocked now, presumably?" [puppet] - 10https://gerrit.wikimedia.org/r/380503 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [13:12:18] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Resource Statement, Unknown resource type: 'site_instances' at /etc/puppet/modules/ganglia/manifests/monitor/aggregator.pp:36:5 - https://phabricator.wikimedia.org/T179165#3715539 (10herron) [13:12:25] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652273 (10herron) [13:13:59] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/kafka/main/broker.pp:14:23 - https://phabricator.wikimedia.org/T179166#3715559 (10herron) [13:14:10] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652273 (10herron) [13:15:57] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/kafka/main/broker.pp:14:23 - https://phabricator.wikimedia.org/T179166#3715559 (10herron) [13:16:32] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/kafka/main/broker.pp:14:23 - https://phabricator.wikimedia.org/T179166#3715585 (10herron) [13:16:54] (03PS1) 10Dzahn: cyberbot::db: ensure mariadb service running, add service [puppet] - 10https://gerrit.wikimedia.org/r/386831 [13:17:13] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/profile/manifests/kafka/broker.pp:92:23 - https://phabricator.wikimedia.org/T179167#3715587 (10herron) [13:17:29] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652284 (10herron) [13:18:39] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/kafka/analytics/burrow.pp:7:15 - https://phabricator.wikimedia.org/T179168#3715606 (10herron) [13:18:49] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3652284 (10herron) [13:18:59] (03PS5) 10Elukey: role::mediawiki::jobrunner: inc runners for refreshLinks/hmtlCacheUpdate [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) [13:19:51] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Resource Statement, Unknown resource type: 'exim_alias_file' at /etc/puppet/private/modules/privateexim/manifests/init.pp:55:2 - https://phabricator.wikimedia.org/T179170#3715634 (10herron) [13:19:56] (03PS6) 10Elukey: role::mediawiki::jobrunner: inc runners for refreshLinks/hmtlCacheUpdate [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) [13:21:58] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, Failed to parse inline template: hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/puppetmaster/puppetdb.pp... - https://phabricator.wikimedia.org/T179171#3715674 [13:22:53] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/logging/kafkatee/webrequest/base.pp:11:21 - https://phabricator.wikimedia.org/T179172#3715688 (10herron) [13:24:07] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, Failed to parse template elasticsearch/elasticsearch_5.yml.erb:\n Filepath: /etc/puppet/modules/elasticsearch/templates/elasticsearch_5.yml.erb\n Line: 447\n Detail: undef... - https://phabricator.wikimedia.org/T179174#3715715 [13:24:09] (03PS1) 10Hoo man: Disable constraints check with SPARQL for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386833 (https://phabricator.wikimedia.org/T179156) [13:24:18] bblack: ^ This should fix the problems at hand [13:24:30] and then we can go back to pre-problem state just fine [13:24:39] just disabling a minor feature [13:25:18] (03CR) 10jerkins-bot: [V: 04-1] Disable constraints check with SPARQL for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386833 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [13:25:33] phpcs… :S [13:25:44] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/profile/manifests/pmacct.pp:19:26 - https://phabricator.wikimedia.org/T179175#3715730 (10herron) [13:25:47] (03PS2) 10Hoo man: Disable constraints check with SPARQL for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386833 (https://phabricator.wikimedia.org/T179156) [13:26:57] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, Failed to parse template openldap/labs-acls.erb:\n Filepath: /usr/lib/ruby/vendor_ruby/puppet/parser/functions.rb\n Line: 275\n Detail: hiera() can only be called using th... - https://phabricator.wikimedia.org/T179176#3715745 [13:27:07] (03CR) 10Giuseppe Lavagetto: [C: 04-1] role::mediawiki::jobrunner: inc runners for refreshLinks/hmtlCacheUpdate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) (owner: 10Elukey) [13:27:59] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Resource Statement, Unknown resource type: 'cronjob' at /etc/puppet/modules/mediawiki/manifests/maintenance/refreshlinks.pp:14:5 - https://phabricator.wikimedia.org/T179177#3715761 (10herron) [13:29:02] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, Failed to parse template thumbor/nginx.conf.erb:\n Filepath: /usr/lib/ruby/vendor_ruby/puppet/parser/functions.rb\n Line: 275\n Detail: hiera() can only be called using th... - https://phabricator.wikimedia.org/T179178#3715777 [13:35:57] 10Operations, 10Puppet, 10User-Joe: Puppet4: hiera() can only be called using the 4.x function API. - https://phabricator.wikimedia.org/T179181#3715841 (10herron) [13:36:49] (03CR) 10Paladox: cyberbot::db: ensure mariadb service running, add service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386831 (owner: 10Dzahn) [13:38:19] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/ganglia/config.pp:4:21 - https://phabricator.wikimedia.org/T179180#3715881 (10herron) [13:38:21] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, Failed to parse template thumbor/nginx.conf.erb:\n Filepath: /usr/lib/ruby/vendor_ruby/puppet/parser/functions.rb\n Line: 275\n Detail: hiera() can only be called using th... - https://phabricator.wikimedia.org/T179178#3715882 [13:38:24] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/profile/manifests/pmacct.pp:19:26 - https://phabricator.wikimedia.org/T179175#3715884 (10herron) [13:38:25] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, Failed to parse template openldap/labs-acls.erb:\n Filepath: /usr/lib/ruby/vendor_ruby/puppet/parser/functions.rb\n Line: 275\n Detail: hiera() can only be called using th... - https://phabricator.wikimedia.org/T179176#3715883 [13:38:28] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, Failed to parse inline template: hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/puppetmaster/puppetdb.pp... - https://phabricator.wikimedia.org/T179171#3715886 [13:38:29] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/logging/kafkatee/webrequest/base.pp:11:21 - https://phabricator.wikimedia.org/T179172#3715885 (10herron) [13:38:33] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/profile/manifests/kafka/broker.pp:92:23 - https://phabricator.wikimedia.org/T179167#3715888 (10herron) [13:38:35] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/kafka/analytics/burrow.pp:7:15 - https://phabricator.wikimedia.org/T179168#3715887 (10herron) [13:38:36] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/webperf.pp:17:22 - https://phabricator.wikimedia.org/T179164#3715890 (10herron) [13:38:39] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Puppet4: Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/graphite/alerts.pp:12:21 - https://phabricator.wikimedia.org/T179163#3715891 (10herron) [13:38:40] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/kafka/main/broker.pp:14:23 - https://phabricator.wikimedia.org/T179166#3715889 (10herron) [13:38:42] 10Operations, 10Puppet, 10User-Joe: Puppet4: hiera() can only be called using the 4.x function API. - https://phabricator.wikimedia.org/T179181#3715880 (10herron) [13:38:44] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/eventlogging/analytics/server.pp:27:29 - https://phabricator.wikimedia.org/T179162#3715892 (10herron) [13:38:50] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Puppet4: Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/graphite/alerts.pp:12:21 - https://phabricator.wikimedia.org/T179163#3715495 (10herron) [13:39:36] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/graphite/alerts.pp:12:21 - https://phabricator.wikimedia.org/T179163#3715495 (10herron) [13:39:51] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/profile/manifests/kafka/broker.pp:92:23 - https://phabricator.wikimedia.org/T179167#3715587 (10herron) [13:40:43] bblack: greg-g: Ok to just disable the troublesome feature? That should unblock us here [13:40:51] and leave very little user facing harm [13:41:11] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wmde/toolkit-analyzer-build],Exec[cdh::hadoop::directory /user/spark] [13:41:34] (some constraints on Wikidata will go away, but that's only visible for a few hours of a gadget) [13:41:38] So hardly anyone should mind [13:41:41] (03PS1) 10BBlack: tlsproxy+numa: only lock memory under "isolate" [puppet] - 10https://gerrit.wikimedia.org/r/386836 [13:42:34] hoo: I don't feel like I have any valid input to offer at this level, I really don't understand the risks or the code or the structure :) [13:42:54] hm… who would be the right person right now? [13:43:02] the only thing I feel strongly about, is we revert the varnish max_conns=10K before trying anything else related to the problem [13:43:20] greg-g would be best, but it's still early over there [13:43:46] :/ [13:45:53] marostegui, looking at https://phabricator.wikimedia.org/T173511#3713170, I have a workshop coming up soon and I'd like to have that dataset in the newly proposed "datasets_p" DB available in quarry soon. Is that possible before Nov 8th or should I figure out something else for the workshop? [13:46:47] (03CR) 10BBlack: [C: 032] tlsproxy+numa: only lock memory under "isolate" [puppet] - 10https://gerrit.wikimedia.org/r/386836 (owner: 10BBlack) [13:47:05] I'll file a task for this specifically so you and jynus can prioritize. [13:47:19] (03CR) 10Dzahn: cyberbot::db: ensure mariadb service running, add service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386831 (owner: 10Dzahn) [13:47:59] (03CR) 10Dzahn: cyberbot::db: ensure mariadb service running, add service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386831 (owner: 10Dzahn) [13:49:51] !log restarting nginx on cp4021, without NUMA memory constraints [13:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:23] (03PS7) 10Elukey: role::mediawiki::jobrunner: inc runners for refreshLinks/hmtlCacheUpdate [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) [13:51:35] stat1005 is basically OOM and will probably start killing stuff… *sigh* [13:51:48] elukey: typo there :) [13:52:00] elukey: s/hmtl/html/ [13:52:26] !log reboot cp4021 to clean up oom messes [13:52:29] (I haven't even clicked on the link, just saw the first line of the commit message here and that stood out :) [13:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "If the regex cache is the problem, I would suggest" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386833 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [13:53:08] 10Operations, 10Puppet, 10User-Joe: Puppet4: hiera() can only be called using the 4.x function API. - https://phabricator.wikimedia.org/T179181#3715957 (10herron) [13:53:10] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API - https://phabricator.wikimedia.org/T179077#3715956 (10herron) [13:53:19] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/analytics_cluster/refinery/job/camus.pp:8:21 - https://phabricator.wikimedia.org/T179077#3712540 (10herron) [13:54:15] paravoid: thanksssss [13:54:53] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Puppet4: Error while evaluating a Function Call, Failed to parse template openstack2/mitaka/horizon/local_settings.py.erb:\n Filepath: /usr/lib/ruby/vendor_ruby/puppet/parser/functions.... - https://phabricator.wikimedia.org/T179086#3715961 [13:55:02] (03PS8) 10Elukey: role::mediawiki::jobrunner: inc runners for refreshLinks/htmlCacheUpdate [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) [13:55:07] 10Operations, 10Puppet, 10User-Joe: Puppet4: hiera() can only be called using the 4.x function API. - https://phabricator.wikimedia.org/T179181#3715841 (10herron) [13:55:09] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Puppet4: Error while evaluating a Function Call, Failed to parse template openstack2/mitaka/horizon/local_settings.py.erb:\n Filepath: /usr/lib/ruby/vendor_ruby/puppet/parser/functions.... - https://phabricator.wikimedia.org/T179086#3712883 [14:03:24] (03CR) 10Paladox: [C: 031] cyberbot::db: ensure mariadb service running, add service [puppet] - 10https://gerrit.wikimedia.org/r/386831 (owner: 10Dzahn) [14:07:50] (03PS2) 10Dzahn: cyberbot::db: ensure mariadb service running, add service [puppet] - 10https://gerrit.wikimedia.org/r/386831 [14:08:30] (03CR) 10Dzahn: [C: 032] cyberbot::db: ensure mariadb service running, add service [puppet] - 10https://gerrit.wikimedia.org/r/386831 (owner: 10Dzahn) [14:09:15] 10Operations, 10Services (watching), 10User-Eevans, 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839#3716017 (10Eevans) [14:09:17] (03CR) 10Daniel Kinzler: [C: 031] "Overriding Lucas' -1." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386833 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [14:11:49] greg-g: PING [14:13:05] hoo: he is not around yet for sure [14:13:10] what is happening? :) [14:13:29] marostegui, when you get back, I filed https://phabricator.wikimedia.org/T179187. Please feel free to make modifications to the tagging :) [14:15:17] hashar: We would like to turn of the feature which quite likely caused yesterday's problems [14:15:27] and hopefully return to pre-issue state then (sans that feature) [14:17:00] hashar: https://gerrit.wikimedia.org/r/386833 [14:17:02] Good to go? [14:22:10] (03PS1) 10Hashar: contint1001: move add_ip6_mapped from site.pp to role [puppet] - 10https://gerrit.wikimedia.org/r/386859 [14:24:25] (03CR) 10Hashar: [C: 04-1] "Bah:" [puppet] - 10https://gerrit.wikimedia.org/r/386859 (owner: 10Hashar) [14:24:45] 10Operations, 10Release-Engineering-Team, 10Patch-For-Review: upload problem for parsoid release - https://phabricator.wikimedia.org/T179134#3716072 (10ssastry) Thanks! :) [14:25:38] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3716073 (10BBlack) Since diagnostics and SEL don't turn up anything, I'll recap what we've observed: * Initially - system was reporting kernel errors such as ` kernel:[30568.739024] NMI watchdog: BUG: soft... [14:27:05] (03CR) 10Hashar: [C: 04-1] "Would it make sense to have a profile::ipv6::mapped ? :]" [puppet] - 10https://gerrit.wikimedia.org/r/386859 (owner: 10Hashar) [14:28:05] hashar: no :P but it would make sense to move the automatic provisioning of mapped-v6 to somewhere common and automagic for all hosts without being asked-for. [14:28:25] hashar: but it's not simple, it's entangled with a lot of other ipv6 configuration issues we haven't solved correctly [14:29:22] I was trying to resolve a wmfstyle issue raised when having "interface::add_ip6_mapped { 'main': }" in site.pp [14:29:26] (03CR) 10Chad: [C: 031] "We don't fall back to LDAP currently either. Although perhaps we could/should." [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [14:29:35] moved it to the role, but that complains as well. So I guess it has to land in a profile [14:30:55] hashar: it's inside the profile for lvs currently (someone moved it there during refactor/cleanup?) [14:31:04] modules/profile/manifests/lvs.pp: interface::add_ip6_mapped { 'main': } [14:31:19] but literally every other case still resides in site.pp currently, and there's a lot of them [14:31:23] hashar: I think tldr is standard, interface::add_ip6_mapped, and base_firewall area anomalies at the moment without a consistent approach [14:31:31] ohhh [14:31:43] so I guess I should just ignore the wmfstyle guide warning [14:32:01] and at some point all will be magically migrated :] [14:32:05] chasemp: bblack thank you! [14:32:06] <_joe_> bblack: "someone" would be me [14:32:44] <_joe_> so the "right" place for the simple/general use would be [14:32:58] <_joe_> to add it to profile::base as a conditional declaration [14:33:00] _joe_: guess I can drop my change that move add_ip6_mapped aroudn? https://gerrit.wikimedia.org/r/#/c/386859/ [14:33:25] <_joe_> it's a chunk of functionality that should be probably be part of our base manifest? [14:33:43] well [14:34:03] in some sense, yes, everywhere we have an ipv4 IP, we really should have a mapped-v6 IP. that would restore some sanity. [14:34:16] <_joe_> even if we don't want to do that [14:34:31] <_joe_> it should be possible to tune that with a hiera variable everywhere [14:34:34] but I suspect blindly turning on "interface::add_ip6_mapped { 'main': }" for all hosts that don't currently have it might cause some real problems, and still not get us to the goal above [14:35:01] <_joe_> bblack: I said a *conditional* declaration [14:35:45] <_joe_> add a parameter to profile::base like add_main_mapped_ip6 => false [14:35:46] right, and only hardcoded to use "main"? [14:35:53] <_joe_> yes [14:36:11] <_joe_> just for the standard case where you want to do that [14:36:19] <_joe_> lvs is way more complicated IIRC [14:36:21] (03CR) 10Chad: "Nothing. Should. Reference. Cobalt. Use the service domain!" [puppet] - 10https://gerrit.wikimedia.org/r/386257 (owner: 10Paladox) [14:36:21] but it's not like that's going to eliminate all future use. e.g. we have an open ticket to go setup add_ip6_mapped on a bunch of virtual ethN.mmmm interfaces on the LVSes. [14:36:37] and LVS isn't the only case where there's more than just 1x standard ethernet port I don't think [14:37:01] (but there are very few others. I think some bonded interfaces somewhere?) [14:37:03] <_joe_> yeah, but that's a specific profile (the load balancer one) where we manage that case [14:37:09] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "> if turning off SPARQL causes constraint checks to use inefficient SQL queries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386833 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [14:37:18] (03PS10) 10Rush: openstack: refactor deployment specific puppetmaster code [puppet] - 10https://gerrit.wikimedia.org/r/386612 (https://phabricator.wikimedia.org/T171494) [14:37:27] (03PS4) 10Paladox: ci::slave::labs: Drop host_aliases for gerrit.wm.org ssh [puppet] - 10https://gerrit.wikimedia.org/r/386257 [14:37:34] <_joe_> wherever we do fancy things, that should be handled separately [14:37:59] (03CR) 10jerkins-bot: [V: 04-1] ci::slave::labs: Drop host_aliases for gerrit.wm.org ssh [puppet] - 10https://gerrit.wikimedia.org/r/386257 (owner: 10Paladox) [14:37:59] everything I do is fancy, that's why it's all separated from sanity so well :) [14:38:31] (03PS5) 10Paladox: ci::slave::labs: Drop host_aliases for gerrit.wm.org ssh [puppet] - 10https://gerrit.wikimedia.org/r/386257 [14:38:34] (03PS1) 10Herron: puppet: update hiera function calls in kafka_config [puppet] - 10https://gerrit.wikimedia.org/r/386866 (https://phabricator.wikimedia.org/T179077) [14:39:00] (03CR) 10jerkins-bot: [V: 04-1] puppet: update hiera function calls in kafka_config [puppet] - 10https://gerrit.wikimedia.org/r/386866 (https://phabricator.wikimedia.org/T179077) (owner: 10Herron) [14:39:08] (03Abandoned) 10Hashar: contint1001: move add_ip6_mapped from site.pp to role [puppet] - 10https://gerrit.wikimedia.org/r/386859 (owner: 10Hashar) [14:39:46] yeah but I guess +1 to the general notion [14:40:07] (03PS2) 10Herron: puppet: update hiera function calls in kafka_config [puppet] - 10https://gerrit.wikimedia.org/r/386866 (https://phabricator.wikimedia.org/T179077) [14:40:29] and I think the count of hosts with/without add_main_ip6_mapped leans heavily in the "with" direction [14:40:38] so yeah, probably a "false" flag for the few without it? [14:40:56] (are there any?) [14:41:17] hmm there are in site.pp anyways [14:43:40] (03PS8) 10Andrew Bogott: git-sync-upstream: rewrite in python [puppet] - 10https://gerrit.wikimedia.org/r/386318 (https://phabricator.wikimedia.org/T177944) [14:43:53] (03CR) 10Giuseppe Lavagetto: [C: 031] puppet: update hiera function calls in kafka_config [puppet] - 10https://gerrit.wikimedia.org/r/386866 (https://phabricator.wikimedia.org/T179077) (owner: 10Herron) [14:44:38] hmm actually there's a ton of hosts that don't have it [14:44:47] (03PS1) 10Ema: varnish: monitor child process restarts [puppet] - 10https://gerrit.wikimedia.org/r/386867 [14:46:33] (03CR) 10Andrew Bogott: [C: 032] git-sync-upstream: rewrite in python [puppet] - 10https://gerrit.wikimedia.org/r/386318 (https://phabricator.wikimedia.org/T177944) (owner: 10Andrew Bogott) [14:46:47] (03PS10) 10Andrew Bogott: git-sync-upstream: perform rebase in a separate, temporary workdir [puppet] - 10https://gerrit.wikimedia.org/r/386331 (https://phabricator.wikimedia.org/T177944) [14:46:52] (03CR) 10Daniel Kinzler: [C: 031] "Lucas: disabling all SPARQL for constraint checks is part of the *investigation*. It's not intended to be part of the solution, unless it " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386833 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [14:47:21] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8512/" [puppet] - 10https://gerrit.wikimedia.org/r/386866 (https://phabricator.wikimedia.org/T179077) (owner: 10Herron) [14:47:30] (03CR) 10Herron: [C: 032] puppet: update hiera function calls in kafka_config [puppet] - 10https://gerrit.wikimedia.org/r/386866 (https://phabricator.wikimedia.org/T179077) (owner: 10Herron) [14:47:33] (03CR) 10Andrew Bogott: [C: 032] git-sync-upstream: perform rebase in a separate, temporary workdir [puppet] - 10https://gerrit.wikimedia.org/r/386331 (https://phabricator.wikimedia.org/T177944) (owner: 10Andrew Bogott) [14:47:41] (03PS3) 10Herron: puppet: update hiera function calls in kafka_config [puppet] - 10https://gerrit.wikimedia.org/r/386866 (https://phabricator.wikimedia.org/T179077) [14:47:58] (03PS2) 10Andrew Bogott: git-sync-upstream: formatting and comment tweaks [puppet] - 10https://gerrit.wikimedia.org/r/386763 (owner: 10BryanDavis) [14:49:03] (03CR) 10Andrew Bogott: [C: 032] git-sync-upstream: formatting and comment tweaks [puppet] - 10https://gerrit.wikimedia.org/r/386763 (owner: 10BryanDavis) [14:49:05] !log turn on cp4024 port on asw-ulsfo [14:49:08] (03PS3) 10Andrew Bogott: git-sync-upstream: formatting and comment tweaks [puppet] - 10https://gerrit.wikimedia.org/r/386763 (owner: 10BryanDavis) [14:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:40] herron: you can merge my patch along with yours [14:50:54] andrewbogott ok will do [14:52:28] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3576064 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` cp4024.ulsfo.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201710271452_b... [14:52:37] (03PS2) 10Ema: varnish: monitor child process restarts [puppet] - 10https://gerrit.wikimedia.org/r/386867 [14:53:06] (03PS3) 10Giuseppe Lavagetto: First version of scap deployment of docker-pkg [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/386641 [14:54:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "I understand, and I wouldn’t -2-block this change even if I had the permission… it’s just frustrating that the option I added for this cas" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386833 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [14:55:53] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler02/8514/cp1008.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/386867 (owner: 10Ema) [14:56:41] andrewbogott will that also fix it for when you have custom changes it would not pull? [14:57:07] paladox: it's the same as before — if there are local changes or unmergable patches it will error out [14:57:12] there's not much of a way to automate that [14:57:19] oh [14:59:38] (03PS2) 10Filippo Giunchedi: prometheus: add redis_exporter class [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) [14:59:40] (03PS1) 10Filippo Giunchedi: hieradata: add redis stretch deployment-prep instances [puppet] - 10https://gerrit.wikimedia.org/r/386869 (https://phabricator.wikimedia.org/T148637) [14:59:42] (03PS1) 10Filippo Giunchedi: redis: add stretch support [puppet] - 10https://gerrit.wikimedia.org/r/386870 (https://phabricator.wikimedia.org/T148637) [14:59:57] (03PS1) 10Herron: puppet: update hiera function calls in kafka_cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/386871 (https://phabricator.wikimedia.org/T179077) [15:00:10] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add redis_exporter class [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [15:01:31] bblack: Could this at all be related to disabling flow control on the varnish NICs? [15:02:40] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8515/" [puppet] - 10https://gerrit.wikimedia.org/r/386871 (https://phabricator.wikimedia.org/T179077) (owner: 10Herron) [15:03:17] Because we see a drop in Wikidata API requests starting about ~18:40 yesterday… [15:03:30] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3716144 (10Lucas_Werkmeister_WMDE) > (Permalink: https://grafana.wikimedia.org/dashboard/db/wikidata-quality?p... [15:03:30] which is a few minutes before our deploy [15:03:38] but immediately after your change [15:04:03] no, I really don't think that's it, but I'll admit it's complicated to justify that. [15:04:25] (03CR) 10Herron: [C: 032] puppet: update hiera function calls in kafka_cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/386871 (https://phabricator.wikimedia.org/T179077) (owner: 10Herron) [15:05:04] the flow control changes themselves certainly shouldn't be capable of being involved in this. more likely the sequence of depooling cache nodes to accomplish the change perturbed patterns, especially for long-running requests. [15:05:05] Ok :) Will (hopefully) continue as planned then [15:05:09] whenever greg-g oks [15:05:35] but also, I'm not really sure about the precision of the timeline right around that mark, either [15:06:14] e.g. it's pretty clear that the initial ORES->wikidata timeouts causing MW fatals are related to the sorting changes, but the 503s related to that on the graphs do happen slightly before the IRC notifications of deployment completion. [15:06:41] andrewbogott i wonder would it be able to support automatically merging changes? [15:06:51] it makes me think the actual effect of that deployment (or the deployment process itself) happened prior to the IRC notification by some small lag [15:06:58] ie when git pulling, it asks me to press x (as it uses nano) to merge the change [15:07:04] could the script do that? [15:07:05] It does automatically merge if the patches can be merged [15:07:17] oh, it dosen't work for me [15:07:32] (it's basically a "deploy complete" stamp, but not a "deploy process started" stamp?) [15:07:34] it merged fine for me when doing it manually, but rebasing through the script failed [15:07:38] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3716169 (10herron) [15:07:40] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/analytics_cluster/refinery/job/camu... - https://phabricator.wikimedia.org/T179077#3716166 [15:08:01] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:08:11] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:08:11] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:08:21] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:08:33] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:08:33] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:08:42] who would have thought :P [15:09:00] :( [15:09:07] bearloga is not available on irc [15:09:23] I'll wait a bit more before kill his R processes [15:11:05] hoo: yeah I donno, all of these timelines are tricky. We could argue for better isolation of various changes, but most of the time most days there's too much going on. We'd all be twiddling our thumbs waiting in the "make changes" queue if we eliminated all overlapping changes at all layers, and it's not that easy to predict which will fail or how long they'll take to fail, either. [15:12:46] (03PS4) 10Giuseppe Lavagetto: First version of scap deployment of docker-pkg [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/386641 [15:12:48] (03PS1) 10Giuseppe Lavagetto: Artifacts for the first release [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/386878 [15:13:26] looking back at: https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5&from=1509042304368&to=1509055422662 [15:14:23] it might be a valid interpretation that the sequenced depools for the ethtool changes caused the first 503 spike (18:40-45), and then the 503s from the wikidata-related stuff didn't start until 18:51 (the second, more-sustained wave) [15:15:15] (which then lines up both of the 503 start-points with known changes in the timeline) [15:16:53] (03PS3) 10Ema: varnish: monitor child process restarts [puppet] - 10https://gerrit.wikimedia.org/r/386867 [15:17:21] bblack: Yeah, that could as well be… would also match up with our data from the hadoop web requests [15:17:43] anyway… everything pointing to the constraints… so we should just turn the relevant stuff off, I guess [15:18:35] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3716222 (10BBlack) So, the reinstall attempt failed with another crash during the installer. I think we have to be looking at bad hardware here: ``` ┌───────────────────────────┤ Configuring apt ├─────... [15:18:51] (03CR) 10Daniel Kinzler: [C: 031] "> the plan is to deploy this config change, move Wikidata Build back to wmf.5, and then hopefully the breakage won’t repeat?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386833 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [15:19:03] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/kafka/analytics/burrow.pp:7:15 - https://phabricator.wikimedia.org/T179168#3716225 (10herron) [15:19:05] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/analytics_cluster/refinery/job/camu... - https://phabricator.wikimedia.org/T179077#3716227 [15:21:19] (03CR) 10BBlack: "Beyond the context discussed right here though, there's some other unrelated emergency fixups to revert in T179156 . If we're pretty sure" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386833 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [15:21:29] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/profile/manifests/kafka/broker.pp:92:23 - https://phabricator.wikimedia.org/T179167#3716229 (10herron) [15:21:31] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/analytics_cluster/refinery/job/camu... - https://phabricator.wikimedia.org/T179077#3716231 [15:22:15] bblack: Agreed… so go back to pre-outage, then disable the feature in question? [15:23:24] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/kafka/main/broker.pp:14:23 - https://phabricator.wikimedia.org/T179166#3716234 (10herron) [15:23:26] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/analytics_cluster/refinery/job/camu... - https://phabricator.wikimedia.org/T179077#3712540 [15:24:24] I would unwind the varnish hack, then unwind anything anything we did that wasn't part of wikidata wmf.5 (e.g. ores timeout change, the remex change, etc?), so we're back to pre-outage state other than the one key bit that wikidatawiki is still on wmf.4 in general. And then disable the feature in question and try moving to wmf.5. [15:24:39] +1 [15:24:40] but I don't know what kind of timeline I'd apply to all of that for sure, and maybe some of it should wait for monday. [15:25:04] I'd like to move everything *but* wikidata to wmf.5 first [15:25:06] We halted the train [15:26:25] 10Operations, 10DBA, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3716241 (10Marostegui) And commonswiki.templatelinks also optimized on dbstore1001 [15:26:50] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/logging/kafkatee/webrequest/base.pp:11:21 - https://phabricator.wikimedia.org/T179172#3716244 (10herron) [15:26:52] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/analytics_cluster/refinery/job/camu... - https://phabricator.wikimedia.org/T179077#3716246 [15:27:04] hashar: just saw your task, not sure DBA is the right tag for that, I will remain subscribed but remove DBA as i don't think there is anything for us to do there:-) [15:27:23] (03PS2) 10Chad: All but wikidatawiki to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386642 [15:29:20] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/profile/manifests/pmacct.pp:19:26 - https://phabricator.wikimedia.org/T179175#3716255 (10herron) [15:29:22] bblack: FWIW, the ORES timeout is already rolled back, I was able to immediately see that it didn’t help anything (probably would have made things worse, actually). The one outstanding ORES change is that we disabled automatic scoring of Wikidata recent changes. Do you agree we should leave that as is until the API performance recovers? [15:29:22] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/analytics_cluster/refinery/job/camu... - https://phabricator.wikimedia.org/T179077#3716257 [15:30:18] no_justification: so if I understand the train correctly, wikidatawiki was already on wmf.5 since weds as it's in group1. thurs (yesterday) it was intended to go to wmf.5 for group2 (all the rest, the wikipedias), but we blocked that because this happened first [15:32:19] awight: the API performance already recovered, I think. So I think in my "unwind" plan above (which is just an uninformed strawman, I shouldn't be the one making these calls I don't think!), I think we'd still revert the disabled scoring of wikidata, before doing the wikidata-level feature-disable -> move back to wmf.5 [15:32:23] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:32:53] PROBLEM - IPMI Sensor Status on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:33:04] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [15:33:23] RECOVERY - Disk space on stat1005 is OK: DISK OK [15:33:23] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [15:33:33] RECOVERY - DPKG on stat1005 is OK: All packages OK [15:33:43] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [15:33:43] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [15:33:53] bblack: OK that makes sense to me. I won’t reenable today because it’s Friday, but can do early on Monday. [15:34:45] (03PS1) 10Herron: puppet: update hiera function calls in ganglia_aggregator_config [puppet] - 10https://gerrit.wikimedia.org/r/386881 (https://phabricator.wikimedia.org/T179180) [15:36:01] gehel: are you around? looks like lag problems are back :( [15:36:23] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:36:26] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: update hiera function calls in kafka_config and kafka_cluster_name - https://phabricator.wikimedia.org/T179077#3716281 (10herron) [15:37:04] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1800.0] [15:37:33] PROBLEM - High lag on wdqs1004 is CRITICAL: CRITICAL: 34.48% of data above the critical threshold [1800.0] [15:37:45] nobody's done anything yet (recently) with stuff from last night, right? [15:37:52] Nope [15:38:00] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3716283 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4024.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['cp4024.ulsfo.wmnet'] ``` [15:39:15] the wdqs lag makes me worry a bit, all context considered [15:39:21] but I don't see a rash of 503s yet either [15:39:31] bblack: Only the pooled wdqs instance are troubling [15:39:46] so I guess the API is for one not the cause [15:40:03] but might still be related to the constraints stuff (it does SPARQL queries afterall) [15:40:25] elevated socket counts are back, on cp1054 this time (the still-in-place 10K limit saving us from 503 fallout).... [15:40:33] bblack: All correct re: train yes [15:40:39] Wikidatawiki got rolled back to wmf.4 though [15:40:58] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8517/" [puppet] - 10https://gerrit.wikimedia.org/r/386881 (https://phabricator.wikimedia.org/T179180) (owner: 10Herron) [15:41:04] PROBLEM - High lag on wdqs1005 is CRITICAL: CRITICAL: 36.67% of data above the critical threshold [1800.0] [15:41:18] (03CR) 10Herron: [C: 032] puppet: update hiera function calls in ganglia_aggregator_config [puppet] - 10https://gerrit.wikimedia.org/r/386881 (https://phabricator.wikimedia.org/T179180) (owner: 10Herron) [15:41:33] Why does this only affect eqiad wdqs instance? [15:41:42] gehel: Do they fetch entities differently? [15:42:14] Ah, I see… this is probably the constraints having trouble with wdqs [15:42:28] which will (or would if we don't have the high limit) eventually lead to a 503 spike [15:42:35] hoo: the hypothesis from SMalyshev is that we benefit more from cache din codfw, but that's only a hypothesis [15:42:39] Can we please turn this off now? [15:42:39] 10Operations, 10Puppet, 10User-Joe: Puppet4: hiera() can only be called using the 4.x function API. - https://phabricator.wikimedia.org/T179181#3716292 (10herron) [15:42:39] (03CR) 10Filippo Giunchedi: [C: 031] varnish: monitor child process restarts [puppet] - 10https://gerrit.wikimedia.org/r/386867 (owner: 10Ema) [15:42:41] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3716293 (10herron) [15:42:46] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/ganglia/config.pp:4:21 - https://phabricator.wikimedia.org/T179180#3716290 (10herron)... [15:43:07] yeah so in varnish stats, I see this problem starting back up today, at circa 14:20-30 or so. things were quiet before then. [15:43:19] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/ganglia/config.pp:4:21 - https://phabricator.wikimedia.org/T179180#3715806 (10herron) [15:43:20] but we're definitely seeing the same symptom as last night under the 10K limit [15:43:23] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:43:23] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:43:23] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:43:33] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:43:43] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:43:44] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:43:56] sigh [15:43:58] gehel, hoo: another hypothesis may be that codfw updater talks to codfw servers for wikidata maybe? [15:44:00] I don't think anything here or in SAL actually happened around that time? [15:44:04] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:44:34] SMalyshev: Mh, codfw is not pooled… and wdqs is (intentionally) doing cache misses [15:44:45] was there a triggering event on the wdqs side? re-enabling something or other a little over an hour ago? [15:44:54] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6461/IPv4: Active [15:44:54] bblack: No… but the feature in question was never turned off [15:44:59] we just re-broke it some more [15:45:03] so that it has less effect [15:45:08] hmmm ok [15:45:33] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=16&fullscreen&orgId=1&var-server=cp1054&var-datasource=eqiad%20prometheus%2Fops&from=1509095608849&to=1509118895538 [15:45:46] ^ this is the restart of the bad/excessive connections on cp1054 symptom [15:45:54] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 29, down: 0, shutdown: 0 [15:46:02] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3716304 (10Halfak) [15:46:03] PROBLEM - Check Varnish expiry mailbox lag on cp4022 is CRITICAL: CRITICAL: expiry mailbox lag is 2064313 [15:46:04] 10Operations, 10ORES, 10Scoring-platform-team: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3716303 (10Halfak) [15:46:18] without the 10K hack, that would've already been in unrelated-503's territory [15:46:43] So… can we please turn the feature off now… if we want, we can turn it back on afterwards… but let's give it a shot [15:46:45] shall we? [15:46:46] !log restart varnish-backend on cp4022 (upload@ulsfo) - mailbox [15:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:07] hoo: if you think this will fix it, yes please. [15:47:28] It's the best bet we have [15:47:34] and it wont harm here [15:47:54] 10Operations, 10Puppet, 10User-Joe: Puppet4: Update hiera function calls in ganglia_aggregator_config - https://phabricator.wikimedia.org/T179180#3716310 (10herron) [15:48:07] !log Compress InnoDB on db2038 (s6) - T178359 [15:48:07] (03CR) 10Hoo man: [C: 032] "Let's try this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386833 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [15:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:13] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [15:48:54] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6461/IPv6: Connect, AS6461/IPv4: Active [15:49:15] XioNoX: ---^ are those snmp failures? [15:49:18] (03Merged) 10jenkins-bot: Disable constraints check with SPARQL for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386833 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [15:49:22] 15:48:59.683 [update 9] DEBUG o.w.q.r.t.w.WikibaseRepository - Done in 156294 ms [15:49:22] 15:49:00.153 [update 3] DEBUG o.w.q.r.t.w.WikibaseRepository - Done in 156456 ms [15:49:25] wow [15:49:27] (03CR) 10jenkins-bot: Disable constraints check with SPARQL for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386833 (https://phabricator.wikimedia.org/T179156) (owner: 10Hoo man) [15:49:30] elukey: CHECKING [15:49:33] er, no cps [15:49:38] ahhahaha :) [15:49:39] thanks [15:49:43] 156s requets... that's way worse than before [15:49:54] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 29, down: 0, shutdown: 0 [15:49:55] before it was like 10s. Now it's 156s [15:50:07] hoo: ^^ [15:50:16] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/eventlogging/analytics/server.pp:27:29 - https://phabricator.wikimedia.org/T179162#3716319 (10herron) [15:50:19] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: update hiera function calls in kafka_config and kafka_cluster_name - https://phabricator.wikimedia.org/T179077#3716321 (10herron) [15:50:33] and now we're getting real 503s I think, btw [15:50:48] something is very seriously broken here, fetching single RDF page should not take 156s [15:50:57] 15:50:30.259 [update 8] DEBUG o.w.q.r.t.w.WikibaseRepository - Done in 246651 ms [15:50:58] !log hoo@tin Synchronized wmf-config/Wikibase-production.php: Disable constraints check with SPARQL for now (T179156) (duration: 00m 50s) [15:50:58] sorry, not 503s [15:51:01] so v4 to Zayo is up, but v6 is still down [15:51:03] even worse :( [15:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:04] T179156: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156 [15:51:04] it's a small rate of 500s [15:51:20] 10Operations, 10Puppet, 10User-Joe: puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/webperf.pp:17:22 - https://phabricator.wikimedia.org/T179164#3716327 (10herron) [15:51:22] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: update hiera function calls in kafka_config and kafka_cluster_name - https://phabricator.wikimedia.org/T179077#3712540 (10herron) [15:51:24] wtf, and now it's the opposite, v4 is down and v6 is up [15:51:32] and now both are up [15:51:47] I'll monitor, don't hesitate to ping me [15:51:53] hoo, gehel: maybe stop the updaters and see whether anything changes? [15:51:56] 500s should have logstash entries, I bet [15:52:00] if it's flaps too much we might disable the sessions for a bit of time [15:52:52] [{exception_id}] {exception_url} Wikibase\DataModel\Services\Statement\StatementGuidParsingException from line 55 of /srv/mediawiki/php-1.31.0-wmf.4/extensions/Wikidata/vendor/wikibase/data-model-services/src/Statement/StatementGuidParser.php: $serializ [15:52:56] but there's only a few of those [15:53:32] 10Operations, 10Puppet, 10User-Joe: Puppet4: Error while evaluating a Function Call, hiera() can only be called using the 4.x function API. See Scope#call_function at /etc/puppet/modules/role/manifests/graphite/alerts.pp:12:21 - https://phabricator.wikimedia.org/T179163#3716335 (10herron) [15:53:34] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: update hiera function calls in kafka_config and kafka_cluster_name - https://phabricator.wikimedia.org/T179077#3716337 (10herron) [15:53:44] Are we improving? [15:54:07] hard to say yet on my end [15:55:24] is the deploy of the disable still ongoing or done? [15:55:48] I guess 15:50 SAL entry above is the "done" mark [15:56:03] RECOVERY - Check Varnish expiry mailbox lag on cp4022 is OK: OK: expiry mailbox lag is 0 [15:56:15] ok, I'll disable wdqs updater, let's see if that changes anything [15:56:34] nothing has changed much in varnish symptoms yet, but they could lag underlying fixes, too [15:56:55] Load on wdqs dropped already [15:57:00] at the point we disabled the feature [15:57:03] ok [15:57:03] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6461/IPv6: Active [15:58:03] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 29, down: 0, shutdown: 0 [15:58:04] !log disabling wdqs updater on all nodes [15:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:12] XioNoX: is that zayo email that just arrived what is going on with the BGP? [15:58:24] gehel: Did you already? [15:58:27] Or are you going to? [15:58:30] hoo: I don't think wdqs load is a big issue now. The issue is that wikidata takes 200-300s to render single rdf export instead of usual 70-80ms [15:58:44] nope, doing it right now [15:59:01] completed [15:59:13] we can re-enable in a few and see if that had an impact... [15:59:32] marostegui: nah, that email is about a circuit between ulsfo and codfw [15:59:43] ah ok :) [16:00:14] so, some of the varnish stats show signs of the beginning of recovery, but I think once this query pattern has hit it long enough, it takes a long time to recover from the bad pattern's effects [16:00:27] so I'm kind of torn between waiting it out or restarting it [16:00:54] "restart it" was the solution last night, but then that leaves a bunch of ambiguity about triggers and causes heh [16:03:36] [waiting it out a little bit, for now] [16:03:44] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [16:03:44] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [16:04:13] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [16:04:24] RECOVERY - Disk space on stat1005 is OK: DISK OK [16:04:24] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [16:04:34] RECOVERY - DPKG on stat1005 is OK: All packages OK [16:05:03] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6461/IPv6: Active [16:05:51] Number of connections from wdqs hosts to test-lb are dropping [16:06:29] someone is going to have to draw a picture sometime soon, of which services call which services, etc [16:07:30] (really, I don't know that it's necessarily sane that we have internal services making inter-service calls through the cache frontends, but I'm sure there were reasons (like to take advantage of caching)) [16:07:33] bblack: https://natashaskitchen.com/wp-content/uploads/2015/01/Spaghetti-and-Meatballs-Recipe-3-600x400.jpg [16:07:43] XioNoX: some BGP session has issue apparently: BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6461/IPv6: Active [16:08:18] XioNoX: I guess that is what you were looking at earlier [16:08:23] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:08:25] ideally my worldview is "public" original requests from browsers hit the cache frontends, they forward to , and then a bunch of inter-service calls may happen but they all happen through foo.svc.eqiad.wmnet, not back through the outer cache layers again [16:08:30] bblack: Funilly enough we are bypassing caches here intentionally [16:08:46] "bypassing" how? [16:08:58] hashar: yeah [16:08:59] funky URL parameters AFAIR [16:09:10] XioNoX: icinga complained about it like 4 minutes ago [16:09:12] bblack: Part of the reason for that is because we don't have good mechanisms for inter-wiki data. [16:09:16] (cf: Wikidata!) [16:09:33] yeah that's crazy though [16:09:37] So if you build a cross-wiki tool, it often involves going off and finding that data from the API or something [16:09:39] This is not interwiki, but wiki -> something completely different [16:09:46] That's what I mean [16:09:55] Hence inter-wiki not interwiki :) [16:09:58] In the Wikis we don't need the APIs at all [16:09:59] why go back through the varnish public entrypoints with a cache-smashing URL parameter, when you could just query https://appservers.svc.eqiad.wmnet/ (or whatever) directly? [16:10:06] !log deactivating BGP sessions to Zayo in eqiad (flapping) [16:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:15] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3716364 (10elukey) [16:10:19] bblack: Honestly? Not thinking it through? [16:11:03] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 27, down: 0, shutdown: 0 [16:11:39] Also, it's "easier" to just make a request to foo.wp.o rather than send to appservers.svc... with a Host: header [16:11:48] The former is kind of obvious, the latter is "Oh yeah I should do that" [16:11:52] * no_justification shrugs [16:12:17] this all makes me wonder if once the bad behavior is triggered (timeouts start happening), it has some looped behavior going on [16:12:19] (03PS1) 10Herron: puppet: update hiera function call in role::puppetmaster::puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/386886 (https://phabricator.wikimedia.org/T179171) [16:12:32] where requests go foo->cache->bar->cache->foo..... until they timeout somewhere [16:12:48] I disabled the sessions and sent an email to Zayo [16:13:18] bblack: Indeed. The extra layer of varnish makes it confusing to follow (cf: yesterday) [16:13:21] anyways, done waiting for varnish recovery, I'm going to restart it [16:13:41] which will either (a) make all problems magically evaporate or (b) move the problem to a different cache node [16:13:49] I wonder if we could get some logging on "internal things that are making HTTP requests to other wikis" to see what needs fixing [16:14:14] it would be pretty easy to capture varnishlog samples of requests to public entrypoints from internal/private IPs [16:14:15] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3716380 (10elukey) [16:14:23] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:14:24] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:14:24] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:14:33] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:14:43] bblack: That would be *very* useful [16:14:44] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:14:45] (and then filter out the ones that are intentional from monitoring systems) [16:14:53] ^ oops, I thought I had that silenced... [16:14:56] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:15:18] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8518/" [puppet] - 10https://gerrit.wikimedia.org/r/386886 (https://phabricator.wikimedia.org/T179171) (owner: 10Herron) [16:15:23] (03CR) 10Herron: [C: 032] puppet: update hiera function call in role::puppetmaster::puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/386886 (https://phabricator.wikimedia.org/T179171) (owner: 10Herron) [16:15:27] damn, I just forgot to press enter... [16:16:20] !log wdqs updater is now stopped for real [16:16:23] !log cp1054 varnish backend restarted (was 503s / bad-conns target of ongoing issues) [16:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:42] so now I'm back to watching the connection counts across all of them to see which (if any) picks up the problem anew [16:17:07] 10Operations, 10Puppet, 10User-Joe: Puppet4: update hiera function call in role::puppetmaster::puppetdb - https://phabricator.wikimedia.org/T179171#3716384 (10herron) [16:17:09] bblack: Do we now know that the disabling helped or are we still more or less clueless? [16:17:12] (so far, nothing immediately obvious, but it could take a few minutes to build up and become noticeable) [16:17:13] 10Operations, 10Puppet, 10User-Joe: Puppet4: hiera() can only be called using the 4.x function API. - https://phabricator.wikimedia.org/T179181#3716388 (10herron) [16:17:17] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3716389 (10herron) [16:17:19] 10Operations, 10Puppet, 10User-Joe: Puppet4: update hiera function call in role::puppetmaster::puppetdb - https://phabricator.wikimedia.org/T179171#3715674 (10herron) 05Open>03Resolved a:03herron [16:17:59] hoo: I think we're more-or-less clueless. if the problem recurs on a fresh varnish, it definitely didn't help [waiting for results on that] [16:18:24] So the problem is expected to move to a different varnish now? [16:18:37] if disabling didn't help, yes [16:18:43] Cool [16:19:01] but technically, if we're being hardcore about what's provable and what isn't [16:19:11] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=16&fullscreen&orgId=1&from=now-4h&to=now&var-server=cp1053&var-datasource=eqiad%20prometheus%2Fops [16:19:22] bblack: Did you also touch that? [16:19:27] if it doesn't move to a different varnish and the problem just evaporates after my restart, we don't know if the disable fixed it and varnish needs a kick to recover, or if varnish is actually the problem. [16:19:55] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [16:19:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [16:19:59] hoo: no, but that's different, those are frontend connections [16:20:13] PROBLEM - puppet last run on wdqs2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[wdqs-updater] [16:20:35] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=16&fullscreen&orgId=1&from=now-4h&to=now&var-server=cp1065&var-datasource=eqiad%20prometheus%2Fops [16:20:40] ^ same on cp1065, etc [16:20:56] that could be the wdqs updater [16:21:05] those are the same stuck connections/queries as the cp1054 backend (that I restarted), at the stage where they're moving through all of the nodes' frontends [16:21:21] aha [16:22:06] so far, the problem has remained dormant now [16:22:45] so because , it could be the case that once we trigger the problematic load through a varnishd backend, it just has a hard time recovering quickly on its own even when the underlying problem goes away. [16:23:03] PROBLEM - puppet last run on wdqs1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[wdqs-updater] [16:23:15] it could also be that once the badness is triggered, it has self-looping behaviors through the cache layer as mentioned above, and the restart breaks the cycle. [16:23:18] But we can't know whether the wdqs updater or the constraint were the issue [16:23:33] or it could be that everything wikidata-related is a red herring and this is a pure varnish problem (but I think that's unlikely) [16:25:21] Can we re-enable the wdqs updater to rule that out? [16:25:40] that's my thought too, but lets let things rest 15mins or so, just to confirm the problem is really currently dead [16:25:45] so that the only change is the disabling of the constraints doing sparql queries [16:25:49] ok [16:27:20] hoo: "disabling of the constraints doing sparql queries": what do you mean? [16:27:36] gehel: The WikibaseQualityConstraints extension was doing SPARQL queries [16:27:44] that affected a specific API only [16:27:49] Oh, ok. [16:27:50] but it was doing that [16:29:53] * hoo would like sudo access to www-data on the wdqs* hosts, so that I can look at open connections there [16:29:54] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:29:54] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:30:03] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.172 second response time [16:32:23] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Fri 2017-10-27 16:32:19 UTC. [16:32:28] bblack: the wdqs updater connects to query.wikidata.org, all that is in fine routed to eqiad, right? [16:32:53] RECOVERY - IPMI Sensor Status on stat1005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [16:32:59] bblack: I'm trying to understand why we see higher response times from eqiad, but not from codfw [16:33:05] gehel: updater does not use query. [16:33:24] gehel: all communication to blazegraph from updates comes via local connection [16:33:25] my bad, wikidata.org [16:33:35] gehel: ah, yes [16:33:39] gehel: also query.wikidata.org splits traffic to codfw+eqiad technically (based on where it entered. so clients in codfw will use codfw) [16:34:01] bblack: my bad mixing up. I meant wikidata.org [16:34:03] interesting, looking at some short samples of internal traffic to public text-lb IP [16:34:05] so I suspect yes, codfw not affected because codfw nay be talking to codfw wikidata [16:34:50] there's a fair number of requests from graphoid UA to public text-lb [16:34:51] (03PS1) 10Herron: puppet: update hiera function call in horizon local_settings template [puppet] - 10https://gerrit.wikimedia.org/r/386887 (https://phabricator.wikimedia.org/T179086) [16:35:22] there are also requests from restbase1nnn.eqiad.wmnet -> text-lb, querying /api/rest_v1/ URLs, which seems quite crazy [16:35:36] (since those are routed right back to RB) [16:36:08] and some service on the scb cluster that doesn't set a user agent requested /wiki/Special:FeedItem/featured/20171027000000/de?useskin=apioutput from the public text-lb entrypoint [16:36:21] I'm sure there's a ton more if you look long and broad, this was a quick sample on one cache host [16:37:00] SMalyshev: wikidata.org resolves to text-lb.codfw.wikimedia.org (in codfw), but as far as I understand, varnish will route that traffic back to eqiad. After a layer of caching [16:37:18] yes [16:37:39] hmm. that kills this theory then. unless layers of caching are to blame... [16:37:42] (because MW (which services wikidata.org queries) isn't active/active) [16:37:57] and I don't expect that layer of cache to improve response times to the extend that we see [16:38:16] gehel: Did you ever look at open connections to the nginx? [16:38:18] so it's about about 15 minutes now, still no signs of trouble [16:38:21] I couldn't :/ [16:38:30] let's turn on wdqs updater? [16:38:35] +1 [16:38:37] sure... [16:39:00] !log re-enabling wdqs-updater [16:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:06] (03CR) 10Volans: [C: 031] "Much nicer structure, I think we can start with this and battle-test it. Few comments inline. I will need to look more on how scap does th" (032 comments) [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/386641 (owner: 10Giuseppe Lavagetto) [16:39:24] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational [16:39:25] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational [16:39:33] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational [16:39:41] updater is now running again... [16:39:43] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational [16:39:54] RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational [16:39:54] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational [16:40:30] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8519/" [puppet] - 10https://gerrit.wikimedia.org/r/386887 (https://phabricator.wikimedia.org/T179086) (owner: 10Herron) [16:40:34] (03Draft1) 10Paladox: javascript: Only install npm package on jessie [puppet] - 10https://gerrit.wikimedia.org/r/386889 [16:40:36] (03PS2) 10Paladox: javascript: Only install npm package on jessie [puppet] - 10https://gerrit.wikimedia.org/r/386889 [16:40:49] (03PS3) 10Paladox: javascript: Only install npm package on jessie [puppet] - 10https://gerrit.wikimedia.org/r/386889 [16:40:53] (03CR) 10Herron: [C: 032] puppet: update hiera function call in horizon local_settings template [puppet] - 10https://gerrit.wikimedia.org/r/386887 (https://phabricator.wikimedia.org/T179086) (owner: 10Herron) [16:40:54] Fetches seem fast again [16:40:57] wdqs lag seems to already be going down, on both eqiad and codfw [16:41:00] now stuck connections on the wdqss [16:41:18] s/now/no/? [16:41:22] hoo: stuck connections? how? [16:41:37] gehel: Well, last time I saw long open tcp connections to text-lb [16:41:40] on eqiad only [16:41:49] from the wdqs updater java process [16:42:14] I think he meant "no stuck connections on the wdqss" [16:42:25] hoo: but not anymore? which is coherent with the updater catching up and wikidata responding reasonably fast [16:42:36] understood [16:43:12] no signs of trouble yet on my end [16:43:31] bblack: any hypothesis on why we did not have updater issues from codfw when we had them from all eqiad servers? [16:44:08] I could imagine a number of reasons, but I don't think I understand the spawling effects of this problem well enough to settle on one :) [16:44:29] (03PS4) 10Paladox: javascript: Remove the npm package [puppet] - 10https://gerrit.wikimedia.org/r/386889 [16:44:30] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Puppet4: update hiera function call in openstack2 horizon local_settings.py template - https://phabricator.wikimedia.org/T179086#3716430 (10herron) [16:44:41] 10Operations, 10Puppet, 10User-Joe: Puppet4: hiera() can only be called using the 4.x function API. - https://phabricator.wikimedia.org/T179181#3716433 (10herron) [16:44:43] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Puppet4: update hiera function call in openstack2 horizon local_settings.py template - https://phabricator.wikimedia.org/T179086#3712883 (10herron) 05Open>03Resolved a:03herron [16:44:44] but it's entirely possible that requests funneled through codfw caches avoided some of the bad effects happening in the eqiad ones. [16:44:45] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3716434 (10herron) [16:45:55] (03PS5) 10Paladox: javascript: Remove the npm package [puppet] - 10https://gerrit.wikimedia.org/r/386889 [16:46:01] I still don't have a much better picture of the total situation than: -> [chain reaction through house of cards] -> lots of other things fail and/or have issues [16:46:24] I still don't have a working theory either … [16:47:07] hm… query.wikidata.org is also behind varnish (but misc, I assume?) [16:47:54] So regular API request (to wd.o) -> sparql query to query.wd.o (through misc varnish) [16:48:03] RECOVERY - puppet last run on wdqs1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:48:21] and the wdqs makes a query back to wikidata.org through text again [16:48:31] yeah there could easily be chains that loop here [16:48:46] nope, there is no loop [16:49:21] wdqs does make requests to wikidata.org right? [16:49:24] the only communication (as far as I know) from wdqs to wikidata is the updater, which is entirely async and separated from the user traffic [16:49:55] (03PS6) 10Paladox: javascript: Remove the npm package [puppet] - 10https://gerrit.wikimedia.org/r/386889 [16:49:58] bblack: yes [16:50:06] I mean updater [16:50:13] RECOVERY - puppet last run on wdqs2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:50:14] blazegraph does not [16:50:15] But that's another process [16:50:23] hmm ok [16:50:23] not even seeing the incoming queries [16:50:24] so yes it does, but not in a way that should allow the creation of loops [16:51:13] PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [16:51:32] herron: is this you? ^^^ [16:51:48] volans oops yes downtime expired [16:52:02] ok [16:52:25] (03PS7) 10Paladox: javascript: Remove the npm package [puppet] - 10https://gerrit.wikimedia.org/r/386889 [16:53:44] still no signs of issue through varnish [16:55:11] magic resolution... not the best :( [16:55:13] 10Operations, 10ORES, 10Scoring-platform-team: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3478146 (10awight) @akosiaris We ran into a little glitch, that wheels created on ores-misc-01 aren't compatible with the new ores* cluster. I'm thinking we should reorder the d... [16:57:29] We could turn on our stuff again and stress it :P [16:57:31] * hoo hides [17:01:13] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.285 second response time [17:01:30] (03PS1) 10BBlack: First stab documenting HFP/HFM cases [puppet] - 10https://gerrit.wikimedia.org/r/386895 [17:02:17] it would be an interesting confirmation at least? if we turned the feature back on and saw things start going crazy shortly after [17:02:35] but... the feature was on all last night too, what made it wait to start causing fallout until a few hours ago? [17:03:22] maybe just the daily load ramp-in [17:03:34] in which case testing now under still-loaded conditions might repro [17:04:24] but maybe it can all wait for monday, too [17:04:37] I'm going to go ahead with wmf.5 on the rest of the non-WD wikis [17:04:42] Er, grab a smoke then do it [17:05:21] because right now either (a) We've identified the root problem and it's shut off (yay, easy weekend!), or (b) we have no freaking idea what's really wrong, but at least if it recurs again this weekend we know the feature disable didn't kill it [17:06:37] no_justification: meaning the train is considered unblocked now? [17:07:12] I think we're ok to unblock it [17:07:19] If you think so too [17:08:31] I wouldn't, if it were on me [17:09:02] there's still a real possibility that if everything stays change-frozen, this problem isn't nailed down and recurs later today or over the weekend [17:09:22] at which point someone somewhere will be screaming "why did we pile more change throughput on top of this?" [17:10:00] This is true. [17:10:10] 10Operations, 10Ops-Access-Requests, 10DBA, 10cloud-services-team (Kanban): Access to raw database tables on labsdb* for wmcs-admin users - https://phabricator.wikimedia.org/T178128#3716479 (10madhuvishy) @Marostegui Yeah that sounds right to me! Cool if I run that across the wiki replicas? [17:10:10] And I woke up with a headache anyway, no need to make it worse :( [17:10:23] RECOVERY - High lag on wdqs1005 is OK: OK: Less than 30.00% above the threshold [600.0] [17:12:53] RECOVERY - High lag on wdqs1004 is OK: OK: Less than 30.00% above the threshold [600.0] [17:13:23] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [17:14:22] 10Operations, 10Ops-Access-Requests, 10DBA, 10cloud-services-team (Kanban): Access to raw database tables on labsdb* for wmcs-admin users - https://phabricator.wikimedia.org/T178128#3716482 (10Marostegui) Maybe only labsdb1010 (sby host) and check that it does what is required and needed. [17:15:45] (03PS1) 10Legoktm: admin: Add legoktm to releasers-mediawiki to upload wikidiff2 tarballs [puppet] - 10https://gerrit.wikimedia.org/r/386896 [17:18:15] no_justification: bblack hey, sorry, we had a fish death this morning that delayed me getting online. [17:18:20] (03PS1) 10Herron: puppet: update hiera function call in openldap labs-acls template [puppet] - 10https://gerrit.wikimedia.org/r/386897 (https://phabricator.wikimedia.org/T179176) [17:18:36] MediaWiki.wikibase.quality.constraints.sparql.timing.rate [17:18:52] Up to 300 sparql queries/ min from the extension [17:19:50] Average over 100 still looks pretty bad… especially around the times it made problems [17:20:01] movingAverage(MediaWiki.wikibase.quality.constraints.sparql.timing.rate,50) [17:20:02] I mean [17:21:05] (03CR) 10Herron: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8520/" [puppet] - 10https://gerrit.wikimedia.org/r/386897 (https://phabricator.wikimedia.org/T179176) (owner: 10Herron) [17:22:52] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: update hiera function call in openldap labs-acls.erb template - https://phabricator.wikimedia.org/T179176#3716508 (10herron) [17:22:58] 10Operations, 10Puppet, 10User-Joe: Puppet4: hiera() can only be called using the 4.x function API. - https://phabricator.wikimedia.org/T179181#3716511 (10herron) [17:23:02] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3716512 (10herron) [17:23:04] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: update hiera function call in openldap labs-acls.erb template - https://phabricator.wikimedia.org/T179176#3715745 (10herron) 05Open>03Resolved a:03herron [17:23:24] hm, but it also doesn't look worse than it used to look [17:23:25] 10Operations, 10Puppet, 10User-Joe: Puppet4: update hiera function call in openldap labs-acls.erb template - https://phabricator.wikimedia.org/T179176#3715745 (10herron) [17:23:40] so the rate of queries is nothing new apparently [17:24:36] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3584101 (10Halfak) Please do not wipe these disks until {T179189} is resolved. [17:25:23] PROBLEM - puppet last run on labtestneutron2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:25:50] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3716518 (10Ottomata) @Cmjohnson it is looking like there was some files that @halfak needs that I did not fully sync over from stat1003 to stat1006. I'm going to try to power stat1003... [17:26:35] 10Operations, 10Ops-Access-Requests, 10DBA, 10cloud-services-team (Kanban): Access to raw database tables on labsdb* for wmcs-admin users - https://phabricator.wikimedia.org/T178128#3716537 (10madhuvishy) @Marostegui Sounds good, thanks [17:26:53] (03PS1) 10Herron: puppet: update hiera function call in thumbor nginx template [puppet] - 10https://gerrit.wikimedia.org/r/386899 (https://phabricator.wikimedia.org/T179178) [17:34:01] bblack: no_justification: Would it be ok to re-enable the feature on mwdebug1001 or 2 only so that I can manually test there? [17:36:27] greg-g: ^ [17:38:51] 10Operations, 10Ops-Access-Requests, 10DBA, 10cloud-services-team (Kanban): Access to raw database tables on labsdb* for wmcs-admin users - https://phabricator.wikimedia.org/T178128#3716553 (10madhuvishy) Cool, I've run ``` GRANT SELECT, SHOW VIEW ON `%wik%`.* TO 'labsdbadmin'@'10.64.37.19'; GRANT SELECT,... [17:41:04] :/ [17:41:17] greg-g: TL;DR is https://phabricator.wikimedia.org/T179156 [17:41:50] (03CR) 10Herron: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8521/" [puppet] - 10https://gerrit.wikimedia.org/r/386899 (https://phabricator.wikimedia.org/T179178) (owner: 10Herron) [17:42:07] we've debated when/what to revert. As it turned out, the stuff that seems to stabilize things last night wasn't all of it. We had some recurrence this morning, so hoo disabled the whole feature and that seems to have stabilized things again. [17:42:16] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3716559 (10Cmjohnson) @ottomata okay please update task when it's okay to wipe. Thanks [17:42:28] "the feature" being the wikibase constraints thing, I think? [17:42:44] which is what hoo's asking to test on mwdebug now too [17:43:04] ah, yes [17:43:27] I'd re-enable it there then try to produce some load to these APIs [17:43:44] and see how that's going [17:43:50] 10Operations, 10Ops-Access-Requests, 10DBA, 10cloud-services-team (Kanban): Access to raw database tables on labsdb* for wmcs-admin users - https://phabricator.wikimedia.org/T178128#3716562 (10bd808) ``` $ ssh labstore1004.eqiad.wmnet $ mysql -h labsdb1010.eqiad.wmnet -u labsdbadmin -p Enter password: (lab... [17:44:10] 10Operations, 10Puppet, 10User-Joe: Puppet4: update hiera function call in thumbor nginx template - https://phabricator.wikimedia.org/T179178#3716563 (10herron) [17:44:15] 10Operations, 10Puppet, 10User-Joe: Puppet4: hiera() can only be called using the 4.x function API. - https://phabricator.wikimedia.org/T179181#3716567 (10herron) [17:44:19] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3716568 (10herron) [17:45:03] it seems to me like there has to be a deeper answer than "things got slower" though. when things get slower, it shouldn't tear everything else apart. maybe it's the kind of thing where if queries that can be very-slow they should only be done asynchronously.... or maybe it's the kind of thing where they need to be split from sharing connection resources and such with other presumably-normall [17:45:09] y-fast traffic that's unrelated? [17:46:01] I guess what I'm trying to say is that if the end-fix is "speed up the queries in question", I think we still have problems looming in the future. [17:48:35] or maybe another way to state it is that this seems more like a design problem than a performance problem [17:49:21] bblack: No, query speed should not be the issue here [17:49:28] it might be the concrete cause [17:49:35] but I think we have something harder at hands [17:49:48] And that I would like to see [17:51:41] I'm fine with re-enabling on mwdebug for testing [17:51:51] maybe for the text-cluster or the mw-and-other-related-services level, we should have some hard cap on request execution time, to avoid stalling out other requests that share resources, too [17:52:25] I'm trying to remember g.wicke's rant on this topic (timeouts at various levels) [17:52:31] I don't know where or how or what it would look like. but e.g. hypothetically anytime varnish asks mediawiki a question, if it doesn't get a full answer in less than X time, it just aborts and sends a 503 to the client. [17:52:54] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3716585 (10herron) [17:52:55] I think someone (maybe not g.wicke) has expressed opposite opinions on that topic though. [17:52:56] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Puppet4: Evaluation Error: Error while evaluating a Function Call, Could not autoload puppet/parser/functions/ordered_yaml: cannot load such file -- puppet/util/zaml.rb - https://phabricator.wikimedia.org/T179076#3716583 (10herron) 05Open>03Resolve... [17:53:16] (that we've already paid the cost for the slow query on the backside, may as well not abort and leave it lingering?) [17:54:14] !log Taking mwdebug1001 to do tests regarding T179156 [17:54:15] from the varnish pov, though, such a limitation is a sanity-constraint that a single query can't take more than X timeslice from the limited reasonable count of connections to mediawiki, and thus can't contributed over-much to starving other parallel queries [17:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:21] T179156: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156 [17:54:42] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3668818 (10herron) [17:55:03] (and if some type fo request is commonly hitting that timeslice limit, it's probably a candidate to move to some other asynchronous mechanism to make the immediate request results faster?) [17:55:24] RECOVERY - puppet last run on labtestneutron2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:55:38] I donno, I'm just rambling about things at this point :P [17:55:48] bblack: :) :) [17:57:31] probably an interesting question to know the answer to, is whether MW (or other services) do actively abort everything they're doing on behalf of a request when the requestor closes the connection abruptly (because it wouldn't tolerate the long execution time) [17:58:23] that'd be a good thing to know before we made any newer choice on the previous question [17:59:03] tim had a lot of things to say in the past on this topic too, during that whole RFC about related things a while back [18:00:19] https://phabricator.wikimedia.org/T97204 [18:01:22] but I think the answers differ a little bit if we're not contemplating a retry, just an abort [18:04:24] my strawman would be every layer of code/service/whatever involved in requests should always just fail instead of retry, when it detects a lower-level failure in a subrequest. I guess now I'm adding to that "and should also fail-but-not-retry if the subrequest takes more than X ms" (some reasonable cutoff for interactive requests). [18:04:36] bblack: Anything interesting in the varnishes? [18:04:49] and then there's this one retry we still have at the very very outer edge, where the first varnish the user comes into contact with will retry a failed request back through the whole stack once-only. [18:05:13] hoo: not statistically [18:05:24] Ok, I'm also bypassing varnish [18:05:32] well directly hitting mwdebug1001 now [18:05:41] but that one goes through misc varnish to call to sparql [18:07:19] 10Operations, 10Ops-Access-Requests, 10DBA, 10cloud-services-team (Kanban): Access to raw database tables on labsdb* for wmcs-admin users - https://phabricator.wikimedia.org/T178128#3716626 (10madhuvishy) I've now rolled this out to labsdb10[01|03|09|10|11]. @Marostegui Is there a file/config/logs somewher... [18:09:04] (03CR) 10Chad: "As owner of this service, I approve :)" [puppet] - 10https://gerrit.wikimedia.org/r/386896 (owner: 10Legoktm) [18:10:07] I'm putting some stress on mwdebug1001 now through varnish [18:10:10] watch out [18:11:25] I still don't really see any ill effects [18:11:42] you're using X-Wikimedia-Debug for queries to wikidata.org via varnish I assume? [18:11:54] PROBLEM - Nginx local proxy to apache on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:12:02] there's a very small connection limit for X-Wikimedia-Debug btw [18:12:08] bblack: Yeah :/ [18:12:08] Ah I see [18:12:19] but still this is hitting wdqs quite a lot [18:12:26] but hey you managed to blow up mwdebug1001 :) [18:12:29] (apparently) [18:12:44] RECOVERY - Nginx local proxy to apache on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.045 second response time [18:13:12] appservers_debug (from varnish) has a connection limit of 20 (same limit that's usually 1K, now 10K, on prod hitting the LVS in front of all N appservers) [18:13:47] it can stuff many fast queries through those connections, but once your parallelism is approaching 20 you'll probably hit that limit [18:14:36] (unless your query URLs are sufficiently chash-diverse to use many varnish backends. then you might get up to ~8*20 as a limit) [18:15:59] Well, I append random garbage [18:16:08] Running another batch now [18:17:05] 10Operations, 10Ops-Access-Requests, 10DBA, 10cloud-services-team (Kanban): Access to raw database tables on labsdb* for wmcs-admin users - https://phabricator.wikimedia.org/T178128#3716639 (10Marostegui) I don't think we are tracking the GRANTS anywhere in puppet, but I cannot check now (I am with my phon... [18:18:40] !log awight@tin Started deploy [ores/deploy@185170f]: Test pip-9 scap trick on ores1002 (non-production) [18:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:58] !log awight@tin Finished deploy [ores/deploy@185170f]: Test pip-9 scap trick on ores1002 (non-production) (duration: 02m 17s) [18:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:03] PROBLEM - Nginx local proxy to apache on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:48] All of this looks good to me [18:21:53] RECOVERY - Nginx local proxy to apache on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.054 second response time [18:22:00] Can't attach strace to hhvm :/ [18:22:09] But not sure whether that would be interesting even [18:23:57] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10Patch-For-Review, and 2 others: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3716644 (10greg) 05Open>03Resolved a:03greg Well, let's close this and make a follow-up task. [18:24:25] A "sudo strace -e trace=network -p 29461 > /home/hoo/hhvm.net-strace.log" would be nice when I run my next batch… possibly [18:24:30] might also be just stupidly boring [18:26:28] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10Release-Engineering-Team (Kanban), and 2 others: Investigate what caused the the unattended varnish upgrade in Beta Cluster - https://phabricator.wikimedia.org/T179197#3716649 (10greg) [18:26:58] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10Release-Engineering-Team (Backlog), 10User-greg: Investigate what caused the the unattended varnish upgrade in Beta Cluster - https://phabricator.wikimedia.org/T179197#3716668 (10greg) [18:27:30] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10Release-Engineering-Team (Backlog): Investigate what caused the the unattended varnish upgrade in Beta Cluster - https://phabricator.wikimedia.org/T179197#3716670 (10greg) a:05greg>03None [18:28:07] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10Patch-For-Review, and 2 others: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704733 (10greg) a:05greg>03hashar [18:28:57] hoo: are those SPARQL request synchronous ? [18:29:09] hoo: I mean from a user perspective? [18:29:24] Yeah [18:29:42] they're very fast [18:29:45] but still [18:29:59] that's sounds like a pretty dangerous thing to do. Response times from WDQS are far from stable [18:30:39] No news for me… [18:30:47] they might be fast in the general case, but there is a good chance that a lot of them are slow [18:31:14] * gehel is worried of edqs in its current form being used for too much synchronous workflows [18:31:20] s/edqs/wdqs/ [18:31:33] +1 [18:31:54] I didn't know about this until a few hours ago either :/ [18:33:16] re: we've debated when/what to revert. As it turned out, the stuff that seems to stabilize things last night wasn't all of it. We had some recurrence this morning, so hoo disabled the whole feature and that seems to have stabilized things again. [18:33:31] hoo: so, the code was hit on other wikis too, right? [18:33:58] ori: No, still only Wikidata [18:34:05] we probably need to rais priority on T178492 and start addressing that soon [18:34:05] T178492: Create a more controlled WDQS cluster - https://phabricator.wikimedia.org/T178492 [18:34:08] but it's hitting misc-web down the line [18:34:16] or put differently: reverting the change rather than just rolling back wikidata would have disabled it completely? [18:34:34] No, but we could have switched off the feature completely [18:34:57] ori: I think the change in question was the sorting-change, the "disable the feature" was to disable the whole extension that contained the sorting change [18:35:19] The extension is still there, but it's only doing very basic checks now (w/o calling out to wdqs) [18:35:23] oh ok [18:35:38] but it has been around like that for at least 2 years now [18:35:42] yeah so, to echo what gehel is saying.... [18:35:43] so that's nothing surprising [18:36:05] anything in the flow of live user requests "calling out to wdqs" sounds scary and sounds like it could cause what we're seeing potentially heh [18:36:27] Yeah :/ [18:38:30] I am still puzzled. The deployment of the CLs that I suggested be reverted was very closely correlated with a spike in database activity. You're saying that reverting wikidata to wmf4 did not completely fix the issue. So was the spike not related to the deployment, or was there some bit of state that persisted after the revert? [18:39:37] ori: Well… this is complicated [18:39:44] I can try [18:39:48] the revert might not have helped, but we don't know for sure [18:39:49] lol [18:40:08] so, I think this "feature" went live on wikidata with the rollout of wmf.5, which happened a day before the incident [18:40:18] It did [18:40:30] and the feature's rollout created the possibility of bad problems, but they hadn't really struck yet [18:40:31] but some parts of this have been live for longer than that [18:40:38] yeah [18:40:47] when the problem hit yesterday, those specific sort slowdowns exacerbated it and made it more obviousd [18:41:12] but the revert to wmf.4 apparently didn't undo everything, because yeah some part of this have been live for longer than that [18:41:17] what was the problem and when did it hit? [18:41:47] but clearly, there's a lot of open wtf-??? in all of this [18:41:52] * gehel needs to go away... but ping me if needed, I'm not far... [18:42:34] I'm not constantly poking the maybe relevant API on mwdebug1001 (bypassing varnish) from terbium. One request in serial. [18:42:35] ori: the user-facing part was that we had a rash of 503s coming out varnish. that turns out to be because varnish hit its 1K-parallel-conns limit to MediaWiki (when the number is usually more like 100). [18:42:52] https://phabricator.wikimedia.org/T179156 [18:42:55] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5&from=now-24h&to=now [18:43:26] bblack: if you believe that we cannot use the query service internally (blocking during api requests) in general, please write an email to (at least) me and lydia. So far, there were no indications that this was a problem. If it is, this seriously screwes with our plans. I see no good alternative atm. [18:43:47] bblack: is there concrete evidence that internal queries to wdqs *are* a problem? [18:44:30] I'm still confused. *Something* happened around 2017-10-26 18:50 that caused some database workloads to spike. What is it? [18:44:31] DanielK_WMDE: my understanding from the past (could be wrong!) is that wdqs is a research tool. that if we were going to use the results of sparql queries to service production queries, we'd process those into results offline and use them another way, not directly through the wdqs raw sparql interface as a sub-query. [18:45:08] ori: most-probably, the rollout of that sorting change. there's maybe a minor chance that the process of code rollout there itself was a problem instead of the sort change, but I'd assume the sort change. [18:45:14] bblack: no, it was never intended as a reasearch tool. it's a public query api. [18:45:34] ok, I think we have a semantics issue here, let's re-phrase... [18:45:59] I know it's a public query api in the sense that the public can reach https://query.wikidata.org/ and execute queries [18:46:07] bblack: the idea was indeed to not run all kinds of queries live (for list generation), but do this offline and cache [18:46:21] ori: The database workloads didn't change [18:46:26] the current use is a kind of hybrid: it's used to answer certain API requests [18:46:39] I beg to differ, hoo [18:46:40] I think what I mean is, we know some of those queries can be slow/problematic. I don't think the intent was to have live wiki pageviews hitting wdqs indirectly and executing arbitrary sparql in realtime? [18:46:52] Well, they did… but this was most likely a symptom [18:47:10] bblack: the pageviews themselves are not. but api queries triggered by page views are [18:47:16] well ok [18:47:49] either way, if SPARQL's getting executed synchronously to (indirectly) service a user pageview, I think that's probably a problem. [18:47:54] in particular, these are not triggered by purges, and only a small numer of people have this feature enabled [18:48:15] hoo: https://i.imgur.com/RQYmGT5.png [18:48:26] lines up pretty nicely with the deployment and the revert [18:48:35] bblack: it's only done for *some* page views, should be relatively few (hoo should know how many) [18:48:58] the idea was to introduce caching before exposing it to anons [18:49:09] ori: But also with the usual day to day query pattern [18:49:33] bblack: but if this is ALWAYS a problem and needs to be kileld *now*, then please make a ticket for it documenting the problem, and send mail to lydia and me. [18:50:21] I'm not even sure what you mean by "ALWAYS a problem and needs to be killed now" [18:50:38] ori: that's interesting. that woudl point to a problem that has nothing to do with sparql, right? [18:51:02] DanielK_WMDE: I have no idea, I'm not talking about SPARQL [18:51:21] ori: ok - bblack is, so i got confused :) [18:51:22] DanielK_WMDE: bblack was thinking through implications, no decisions have been made yet, we're still investigating deeper causes :) [18:51:27] tho: putting any external service in the critical path of mediawiki requests (or *any* subset thereof) couples services together and creates the potential for cascading failures. In the past it was generally understood that if you're doing that, the onus is on you to demonstrate that it is safe by showing how failures of the service won't cascade to mediawiki [18:52:42] ori: well, the consequence would eb that a specific kidn of API request would take long to respond (as long as the SPARQL timeout) [18:53:15] DanielK_WMDE: yeah but all the API requests share mediawiki servers and instances, and a pool of connections from varnish. one small type slowing down can take everything else with it. [18:53:55] bblack: yes. but we have quite a few api actions that can potentially take long, right? [18:54:07] one thing that we can probably do right away is force a low timeout on such queries [18:54:11] I couldn't say, I'm not that versed in the API :) [18:54:45] DanielK_WMDE: yes, and that has been identified as a risk to reliability. the timeouts have been slashed pretty aggressively in the last couple of years IIRC [18:54:58] what I can say is that normally, our varnish instances have a fairly steady-state count of parallel connections to MediaWiki, at ballpark 100/varnish-server. We had a sanity-limit set at 1K, and this tripped the sanity limit and started 503ing all other random mediawiki requests coming through the cache. [18:54:58] Ah - hoo sais that we are using a query timeout of 5 seconds, and htto timeout after 6 seconds [18:55:03] that *should* prevent this [18:55:18] we raised the limit to 10K as a workaround and the connections settled in at ~3K [18:55:33] so, whatever changed here, it caused concurrency from varnish->mediawiki to go up by 30x overall [18:55:36] bblack: is it possibel to dump all query strings when the limit is hit? that would give us valuable information. [18:56:32] once the limit is hit, it's like finding needles in haystacks. a random subset of all user-facing queries are failing [18:56:59] when the sanity-limit out of the way, most requests succeed (although the connection load causes long-term issues for the varnishd) [18:57:32] I did a lot of live debugging during that last night, but I was never able to pin down a concrete, repeatable request that was some kind of "ah-hah, it's this exact request that kills things" [18:57:34] the apache log line format includes backend response latency. you could extract log lines from the time period of interest and aggregate. I've done that in the past. There are probably smarter ways of doing it now [18:58:13] Hadoop has that as well… we looked at some, yet [18:58:15] bblack: well, if some kind of request takes too long and hogs the connection pool, then many instances of that request should be in the dump, no? [18:58:16] but I spent a lot of that time looking for completely-failing/timeout requests, I didn't realize until later that it might just be quite-slow-but-successful requests clogging the connection pipeline [18:58:18] but nothing overly interesting [18:58:26] *sigh* we really need better stats [18:58:29] * ori -> away [18:58:54] is that ever a false statement? :) [18:58:59] do we have stats about which api module takes how long on avg and max? [19:00:07] ok, hoo and i are late for the wikidatacon opening party. and it seems like all we can do is speculate. [19:00:16] I don't know if there's a dashboard, but the data is there. see https://grafana.wikimedia.org/dashboard/db/api-requests?refresh=5m&orgId=1 [19:00:31] ori: thank you! [19:01:22] huh... that seems to have a lot of modules, but no wbcheckconstrainst [19:01:27] that's odd [19:02:19] anyway... i'm getting the feeling that the api-module-hitting-sparql may be a red herring... [19:03:11] bblack, greg-g: thank you for looking into this. please let us know if we can do anything to help with the investigation. [19:03:40] of course, enjoy wikidatacon [19:04:23] bblack, greg-g: so far, it seemed that calling another service internally should be fine, as long as we set a low http timeout - which we did. if this assumption is wrong, please tell us asap. That would mean re-engineering a few features. [19:04:31] that would suck, but i'd like to know asap. [19:04:44] PROBLEM - puppet last run on mc2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:04:52] it's early to be making any hard recommendations. I think we still don't have a complete handle on wtf is really happening. [19:05:15] DanielK_WMDE: don't jump to conclusions yet :) [19:06:16] don't hate me, but https://landing.google.com/sre/book/chapters/addressing-cascading-failures.html is pretty good [19:06:23] ™ [19:06:33] 26 17 94.59826500000003 117 [19:06:33] 26 18 238.81818199999998 229 [19:06:34] 26 19 304.95121600000004 329 [19:06:36] ori: we'll always love you ;) [19:06:55] day number sum(time to first byte) and no. of requests [19:07:00] to the checkconstraints api [19:07:04] pretty boringly low [19:08:07] that's nothing. i don't see how that could possibly cause anything to back up... [19:08:13] anyway. ttfn [19:08:53] leaving now as well :/ [19:08:59] Sorry, but I don't see anything to do here [19:09:01] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): upload problem for parsoid release - https://phabricator.wikimedia.org/T179134#3716774 (10greg) [19:09:32] !log Ran scap pull on mwdebug1001 [19:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:21] 10Operations, 10ops-ulsfo, 10Traffic: setup/deploy wmf741[56] - https://phabricator.wikimedia.org/T179204#3716775 (10RobH) [19:10:39] You can call/ ping me up on my phone if needed [19:11:26] 10Operations, 10ops-ulsfo, 10Traffic: setup/deploy wmf741[56] - https://phabricator.wikimedia.org/T179204#3716775 (10RobH) I've assigned this to @bblack for now, as my understanding is he is in charge of the re-structuring of our caching center services across the three misc systems. (This task has two sys... [19:11:52] bblack: ^ so on that im not trying to dodge work, i just have no idea what we want to call them or whats living on each [19:12:01] happy to do installs and help apply role later [19:14:16] robh: ack, it's fine :) [19:14:39] ori: yes, there is so much good stuff in there :) [19:15:18] The "Bimodal latency" section is one of the many that sounds relevant in this scenario heh [19:17:07] it almost perfectly describes what we think was happening, and how the 503s started hitting all other requests [19:17:21] (from a functional outer-layer point of view, anyways) [19:22:23] back on the whole sane-timeout-limits thing. varnish has some configurable timeouts per-backend. but notably they have a connect_timeout (how long we'll stall waiting on a new or freed-up connection to service this request), and a first_byte_timeout (time to first response byte), and a between_bytes_timeout (timeout if no data is flowing in mid-response) [19:22:29] but it has no overall transaction timeout [19:22:51] so if a bad backend trickles in response data at 1 byte-per-second for minutes-long, no timeout will catch it :P [19:23:06] (seems like a silly oversight) [19:25:18] bblack: 503 issue still? [19:25:46] Or did we officially fix it? [19:26:28] we're officially stable last I heard [19:26:43] but there's a lot things up in the air about the incident in general, I think [19:26:50] Okay i wasnt sure is there a task yet ? [19:27:10] https://phabricator.wikimedia.org/T179156 [19:27:16] Thanks... [19:27:40] np! [19:30:18] FTR, the applicable settings on the text cluster for all backends are connect_timeout:5s, first_byte_timeout:180s, between_bytes_timeout:60s [19:31:03] (and while we have puppetization to easily control the first two, the last one is at the compiled defaults with no puppetization presently) [19:31:42] I'm thinking maybe we should be able to at least flip a switch pretty quickly to shorten those timeouts dramatically [19:31:57] (thus causing slow-running requests to turn into 503s but letting more of the short requests be successful) [19:33:09] hopefully that would be the fallout anyways, unless MW keeps doing all the behind-the-scenes work for the expensive requests after varnish dumps the connection to abort [19:33:19] I think we should puppetize between bytes before being able to "flick a switch" [19:33:28] yes [19:34:14] the vast majority of requests don't involve anywhere near those kinds of backend latencies [19:34:44] RECOVERY - puppet last run on mc2028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:45:39] (03PS1) 10BBlack: Varnish: puppetize per-backend between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/386939 (https://phabricator.wikimedia.org/T179156) [19:47:42] (03CR) 10BBlack: [C: 032] Varnish: puppetize per-backend between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/386939 (https://phabricator.wikimedia.org/T179156) (owner: 10BBlack) [20:01:31] (03CR) 10Aaron Schulz: [C: 031] role::mediawiki::jobrunner: inc runners for refreshLinks/htmlCacheUpdate [puppet] - 10https://gerrit.wikimedia.org/r/386636 (https://phabricator.wikimedia.org/T173710) (owner: 10Elukey) [20:24:49] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3716891 (10hashar) 05Open>03Resolved a:03... [20:54:25] !log running migratePreferences.php on group2 wikis [20:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:04] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3717006 (10greg) [22:06:54] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0 [22:28:23] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:28:23] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:28:24] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:28:44] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:28:53] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:29:03] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:30:23] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:30:29] o.O [22:30:47] something wrong with stat1005? [22:33:23] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [22:33:24] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [22:33:24] RECOVERY - Disk space on stat1005 is OK: DISK OK [22:33:44] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [22:33:53] RECOVERY - DPKG on stat1005 is OK: All packages OK [22:34:03] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [22:35:23] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures