[00:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171128T0000). [00:00:05] kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:17] o/ [00:01:30] (03CR) 10Kaldari: [C: 032] Enable MP3 uploads on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393661 (https://phabricator.wikimedia.org/T120288) (owner: 10Kaldari) [00:02:06] kaldari: did you see my email? [00:02:17] no :P [00:02:25] emails [00:02:28] from 5 days ago [00:03:02] how should I contact you with questions regarding a planned deployment that I don't want to ask on task? [00:03:06] (03CR) 10Kaldari: [C: 04-2] Enable MP3 uploads on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393661 (https://phabricator.wikimedia.org/T120288) (owner: 10Kaldari) [00:03:17] greg-g: Sorry, I was on vacation. I see it now. [00:03:23] greg-g: postponing [00:03:33] Thank you. [00:04:57] (03PS1) 10Krinkle: Gerrit: Fix negative cut-off logo on page load [puppet] - 10https://gerrit.wikimedia.org/r/393691 [00:05:09] (03PS2) 10Dzahn: ganeti: remove Ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393683 (https://phabricator.wikimedia.org/T177225) [00:25:05] RECOVERY - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is OK: TCP OK - 0.000 second response time on 10.64.0.230 port 9042 [00:29:56] (03CR) 10Dzahn: "@akosiaris: I made a Grafana dashboard for ganeti (eqiad)" [puppet] - 10https://gerrit.wikimedia.org/r/393683 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:32:03] (03CR) 10Thcipriani: [C: 031] "Much better <3" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393691 (owner: 10Krinkle) [00:34:18] (03PS1) 10Madhuvishy: public_dumps: Add puppet class to set up NFS for dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/393695 (https://phabricator.wikimedia.org/T181431) [00:35:04] (03PS2) 10Madhuvishy: public_dumps: Add puppet class to set up NFS for dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/393695 (https://phabricator.wikimedia.org/T181431) [00:38:33] (03CR) 10Madhuvishy: [C: 032] public_dumps: Add puppet class to set up NFS for dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/393695 (https://phabricator.wikimedia.org/T181431) (owner: 10Madhuvishy) [00:38:45] (03PS1) 10Dzahn: grafana: add dashboards for ganeti [puppet] - 10https://gerrit.wikimedia.org/r/393696 [00:43:40] (03PS2) 10Dzahn: grafana: add dashboards for ganeti [puppet] - 10https://gerrit.wikimedia.org/r/393696 (https://phabricator.wikimedia.org/T177225) [00:44:50] (03PS3) 10Dzahn: ganeti: remove Ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393683 (https://phabricator.wikimedia.org/T177225) [00:45:51] (03CR) 10Dzahn: [C: 032] "https://gerrit.wikimedia.org/r/#/c/393696/" [puppet] - 10https://gerrit.wikimedia.org/r/393683 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:49:12] (03CR) 10Dzahn: "this removed /etc/ganglia/, /usr/lib/ganglia/, package libganglia (expected) and also "/etc/systemd/system/ganglia-monitor-aggregator@.ser" [puppet] - 10https://gerrit.wikimedia.org/r/393683 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:49:23] (03PS2) 10Krinkle: Gerrit: Fix negative cut-off logo on page load [puppet] - 10https://gerrit.wikimedia.org/r/393691 [00:59:42] (03CR) 10Kaldari: Enable MP3 uploads on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393661 (https://phabricator.wikimedia.org/T120288) (owner: 10Kaldari) [01:00:45] (03PS1) 10Dzahn: grafana: add dashboard for cloud [puppet] - 10https://gerrit.wikimedia.org/r/393698 [01:37:06] (03CR) 10Paladox: "Hi, is this meant to have user interface changes?" [puppet] - 10https://gerrit.wikimedia.org/r/393691 (owner: 10Krinkle) [01:41:07] (03CR) 10Krinkle: "No, it is meant to look the same. However, you should notice an improvement with how it loads." [puppet] - 10https://gerrit.wikimedia.org/r/393691 (owner: 10Krinkle) [01:43:22] (03CR) 10Paladox: [C: 031] "Oh i see. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/393691 (owner: 10Krinkle) [01:53:05] (03CR) 10Thcipriani: [C: 04-1] "Decided to try this in the new Firefox and noticed some weird behavior. Could be a problem with how I was testing (just using developer co" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393691 (owner: 10Krinkle) [01:55:11] (03PS3) 10Krinkle: Gerrit: Fix negative cut-off logo on page load [puppet] - 10https://gerrit.wikimedia.org/r/393691 [01:55:23] (03CR) 10Thcipriani: [C: 04-1] ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393691 (owner: 10Krinkle) [01:56:23] (03CR) 10Krinkle: "Fixed. Added the correct min-width for 60px which matches the image." [puppet] - 10https://gerrit.wikimedia.org/r/393691 (owner: 10Krinkle) [01:56:46] (03CR) 10Thcipriani: [C: 031] Gerrit: Fix negative cut-off logo on page load [puppet] - 10https://gerrit.wikimedia.org/r/393691 (owner: 10Krinkle) [02:20:37] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.8) (duration: 05m 23s) [02:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:39] 10Operations, 10Ops-Access-Requests, 10Gerrit: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3790664 (10Legoktm) I think there's probably other ways we can help Paladox contribute in this area that don't require the nda access - is this just about wanting to vi... [03:24:05] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 726.83 seconds [03:31:09] !log labservices1001/1002,labtestservices2001 - remove pdns_gmetric cronjobs causing cron spam after ganglia decom from lab* [03:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:41:38] (03CR) 10Chad: [C: 031] Gerrit: Fix negative cut-off logo on page load [puppet] - 10https://gerrit.wikimedia.org/r/393691 (owner: 10Krinkle) [03:41:48] (03CR) 10Chad: [C: 031] "(can land whenever, doesn't need service restart)" [puppet] - 10https://gerrit.wikimedia.org/r/393691 (owner: 10Krinkle) [03:47:21] 10Operations, 10Ops-Access-Requests, 10Gerrit: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3791246 (10demon) >>! In T181446#3791114, @Legoktm wrote: > I think there's probably other ways we can help Paladox contribute in this area that don't require the nda a... [03:50:05] (03PS4) 10Dzahn: Gerrit: Fix negative cut-off logo on page load [puppet] - 10https://gerrit.wikimedia.org/r/393691 (owner: 10Krinkle) [03:50:24] (03PS1) 10Dzahn: ganglia: ensure more things are gone in decom class [puppet] - 10https://gerrit.wikimedia.org/r/393707 (https://phabricator.wikimedia.org/T177225) [03:50:53] (03CR) 10Dzahn: [C: 032] Gerrit: Fix negative cut-off logo on page load [puppet] - 10https://gerrit.wikimedia.org/r/393691 (owner: 10Krinkle) [03:51:27] (03PS2) 10Dzahn: ganglia: ensure more things are gone in decom class [puppet] - 10https://gerrit.wikimedia.org/r/393707 (https://phabricator.wikimedia.org/T177225) [03:52:33] (03CR) 10Dzahn: "deployed" [puppet] - 10https://gerrit.wikimedia.org/r/393691 (owner: 10Krinkle) [03:55:25] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 295.20 seconds [03:55:45] (03PS3) 10Dzahn: phabricator: Fix elasticsearch version field [puppet] - 10https://gerrit.wikimedia.org/r/393655 (https://phabricator.wikimedia.org/T181437) (owner: 10Paladox) [03:56:30] (03CR) 10Dzahn: [C: 032] phabricator: Fix elasticsearch version field [puppet] - 10https://gerrit.wikimedia.org/r/393655 (https://phabricator.wikimedia.org/T181437) (owner: 10Paladox) [03:57:27] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3791253 (10Imarlier) What's the log format of the message field? I'm guessing the first three fields are start_timestamp, total_time, be_time? [03:58:41] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.4 (duration: 03m 13s) [03:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:08] (03CR) 10Dzahn: "*bump* still true about "land immediately"? guess i am back on it then :p" [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [04:10:42] 10Operations, 10Ops-Access-Requests, 10Gerrit: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3791269 (10Dzahn) Confirmed. This was about Gerrit logs. If there was a way to request "logstash but just Gerrit" then that would have been the request. Paladox is the... [04:31:18] (03PS1) 10Dzahn: various misc roles: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393708 (https://phabricator.wikimedia.org/T177225) [04:37:17] (03PS1) 10Dzahn: rename phabricator_server to just phabricator [puppet] - 10https://gerrit.wikimedia.org/r/393709 [04:39:57] (03PS1) 10Dzahn: rename planet_server to just planet [puppet] - 10https://gerrit.wikimedia.org/r/393710 [04:40:41] (03CR) 10jerkins-bot: [V: 04-1] rename planet_server to just planet [puppet] - 10https://gerrit.wikimedia.org/r/393710 (owner: 10Dzahn) [04:40:58] (03PS1) 10KartikMistry: apertium-crh: New upstream release [debs/contenttranslation/apertium-crh] - 10https://gerrit.wikimedia.org/r/393711 (https://phabricator.wikimedia.org/T181465) [04:43:50] (03PS1) 10Dzahn: rename requesttracker_server to just requesttracker [puppet] - 10https://gerrit.wikimedia.org/r/393712 [04:47:41] (03PS2) 10Dzahn: rename planet_server to just planet [puppet] - 10https://gerrit.wikimedia.org/r/393710 [05:05:08] !log labweb1001/labweb1002: manually purging ganglia package/config/service/unit files because puppet is disabled there (T177225) [05:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:15] T177225: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225 [05:06:56] PROBLEM - SSH on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:07:05] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:07:46] RECOVERY - SSH on scb1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [05:07:56] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [05:11:14] (03PS1) 10Dzahn: silver (wikitech): remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393714 [05:12:28] (03CR) 10Dzahn: [C: 032] silver (wikitech): remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393714 (owner: 10Dzahn) [05:21:34] (03CR) 10Legoktm: "Yay! We should probably document in the source maybe where the rss.py file was obtained from. And have we considered asking for it to be p" [puppet] - 10https://gerrit.wikimedia.org/r/392657 (owner: 10Paladox) [05:23:45] !log labtestpuppetmaster2001 - manually purging ganglia things since puppet is disabled [05:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:19] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): username in use when creating account at wikitech - https://phabricator.wikimedia.org/T180813#3791356 (10bd808) The LDAP account for cn=Flominator has a create date of 2010-10-15T00:45:57Z and no `mail` attribute which matches the svn account ex... [05:50:47] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Recover "Flominator" svn account for use as a modern developer account - https://phabricator.wikimedia.org/T180813#3770464 (10bd808) [05:51:40] (03CR) 10Andrew Bogott: "Giuseppe confirms that this change is correct but the comment is almost totally wrong" [puppet] - 10https://gerrit.wikimedia.org/r/393677 (owner: 10Andrew Bogott) [05:52:55] (03PS2) 10Andrew Bogott: profile::puppetmaster::common: Always enable environments [puppet] - 10https://gerrit.wikimedia.org/r/393677 [05:52:57] (03PS2) 10Andrew Bogott: puppetmaster::standalone: include environment env [puppet] - 10https://gerrit.wikimedia.org/r/393678 [06:25:21] (03PS1) 10Marostegui: mariadb: Reimage db1099 [puppet] - 10https://gerrit.wikimedia.org/r/393717 (https://phabricator.wikimedia.org/T178359) [06:27:27] (03PS2) 10Marostegui: mariadb: Reimage db1099 [puppet] - 10https://gerrit.wikimedia.org/r/393717 (https://phabricator.wikimedia.org/T178359) [06:29:19] (03CR) 10Marostegui: [C: 032] mariadb: Reimage db1099 [puppet] - 10https://gerrit.wikimedia.org/r/393717 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:36:47] 10Operations, 10Ops-Access-Requests, 10Gerrit: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3791403 (10greg) 05Open>03declined Thank you, @Paladox, for all of your help with this effort (and more) with Gerrit. Unfortunately, per @faidon I'm declining this... [06:39:07] !log Stop MySQL on db1099 to copy its content to dbstore1001 and reimage it as multi-instance - T178359 [06:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:15] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [07:28:01] 10Operations, 10Ops-Access-Requests, 10Gerrit: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3791417 (10Legoktm) Do we have alerts for Gerrit exceptions? If we could set up IRC alerts (or something else), paladox could follow those and have someone with logstas... [07:30:03] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [07:30:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [07:31:04] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [07:31:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [07:31:13] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 [07:32:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [07:33:43] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [07:46:13] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [07:46:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [07:47:44] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [08:02:19] !log bootstrap restbase1007-b - T179422 [08:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:29] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [08:03:33] RECOVERY - cassandra-b SSL 10.64.0.231:7001 on restbase1007 is OK: SSL OK - Certificate restbase1007-b valid until 2018-08-17 16:10:54 +0000 (expires in 262 days) [08:09:42] (03PS1) 10Marostegui: mariadb: Convert db1099 to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/393721 (https://phabricator.wikimedia.org/T178359) [08:11:10] (03PS2) 10Marostegui: mariadb: Convert db1099 to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/393721 (https://phabricator.wikimedia.org/T178359) [08:13:44] RECOVERY - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is OK: TCP OK - 0.000 second response time on 10.64.0.231 port 9042 [08:13:52] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/9002/" [puppet] - 10https://gerrit.wikimedia.org/r/393721 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:16:08] (03PS1) 10Urbanecm: Define throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393722 (https://phabricator.wikimedia.org/T181367) [08:16:31] (03CR) 10Marostegui: [C: 032] mariadb: Convert db1099 to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/393721 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:19:52] (03PS1) 10Jcrespo: mariadb: Depool db1082 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393723 (https://phabricator.wikimedia.org/T177208) [08:24:00] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1082 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393723 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [08:26:46] Anybody who can add 393722 to the deployment calendar at wikitech for today EU SWAT for me? I lost access to my 2FA device [08:27:48] (03CR) 10Hashar: diamond: skip DiskSpace for Docker containers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T177052) (owner: 10Hashar) [08:27:56] (03PS3) 10Hashar: diamond: skip DiskSpace for Docker containers [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T177052) [08:28:56] hashar maybe? [08:30:55] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1082 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393723 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [08:32:22] (03Merged) 10jenkins-bot: mariadb: Depool db1082 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393723 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [08:32:37] (03CR) 10jenkins-bot: mariadb: Depool db1082 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393723 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [08:33:35] 10Operations, 10Cloud-Services, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 3 others: Grafana reports ALL docker mounts in a spammy way - https://phabricator.wikimedia.org/T177052#3645705 (10hashar) [08:34:53] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1082 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393724 [08:35:02] (03CR) 10Jcrespo: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393724 (owner: 10Jcrespo) [08:37:00] I've found the change I just commited on tin, did someone rebase? [08:37:19] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3791577 (10hashar) >>! In T178454#3789255, @Dzahn wrote: > Is https://gerrit.wi... [08:37:21] oh, no, it was me [08:38:38] !log jynus@tin Synchronized wmf-config/db-eqiad.php: depool db1082 (duration: 00m 44s) [08:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:01] 10Operations, 10ops-eqiad, 10Discovery, 10Maps, 10Maps-Sprint: Power supply issue on maps1002 - https://phabricator.wikimedia.org/T181477#3791578 (10Gehel) [08:39:27] ACKNOWLEDGEMENT - IPMI Sensor Status on maps1002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] Gehel Tracked in https://phabricator.wikimedia.org/T181477 [08:39:31] I got errors on mw1263 [08:40:54] I see UpdateBetaFeatureUserCountsJob::run failing to adquire lock [08:48:59] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 've started looking into this and then realized we kind of already have it" [puppet] - 10https://gerrit.wikimedia.org/r/393696 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [08:53:55] 10Operations: Requesting access to terbium/wasat for Trey Jones - https://phabricator.wikimedia.org/T181479#3791630 (10dcausse) [08:56:21] !log ppchelko@tin Started deploy [cpjobqueue/deploy@2212086]: Move enabled jobs config to vars.yaml [08:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:49] (03CR) 10Gehel: [C: 04-1] "minor comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392554 (https://phabricator.wikimedia.org/T181016) (owner: 10Smalyshev) [08:57:00] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@2212086]: Move enabled jobs config to vars.yaml (duration: 00m 39s) [08:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:08] (03PS1) 10Marostegui: filtered_tables: Add new columns [puppet] - 10https://gerrit.wikimedia.org/r/393725 (https://phabricator.wikimedia.org/T174569) [09:04:36] !log Drop database log from dbstore1002 - T156844 [09:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:43] T156844: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844 [09:05:25] bye bye dbstore1002, the analytics team gives you back a lot of space :) [09:06:00] (03CR) 10Gehel: [C: 04-1] "minor comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) (owner: 10Smalyshev) [09:07:07] 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3791689 (10Marostegui) ``` root@dbstore1002:~# mysql --skip-ssl Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB... [09:10:30] (03CR) 10Marostegui: [C: 031] Delete role::mariadb::analytics [puppet] - 10https://gerrit.wikimedia.org/r/393597 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [09:10:41] (03PS2) 10Elukey: Delete role::mariadb::analytics [puppet] - 10https://gerrit.wikimedia.org/r/393597 (https://phabricator.wikimedia.org/T156844) [09:10:44] !log ppchelko@tin Started deploy [cpjobqueue/deploy@b0b1793]: Remove double-processing created for consumer group renaming [09:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:12] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@b0b1793]: Remove double-processing created for consumer group renaming (duration: 00m 28s) [09:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:23] (03CR) 10Jcrespo: "We need to check with https://phabricator.wikimedia.org/T103011#3536648" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393725 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [09:12:08] (03CR) 10Elukey: [C: 032] Delete role::mariadb::analytics [puppet] - 10https://gerrit.wikimedia.org/r/393597 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [09:14:26] (03CR) 10Elukey: [C: 032] Fix prometheus target for the Eventlogging mysql master db [puppet] - 10https://gerrit.wikimedia.org/r/393220 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [09:14:33] (03PS4) 10Elukey: Fix prometheus target for the Eventlogging mysql master db [puppet] - 10https://gerrit.wikimedia.org/r/393220 (https://phabricator.wikimedia.org/T177405) [09:15:08] !log installibg libxml-libxml-perl security updates on trusty (Debian already fixed) [09:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:14] (03CR) 10Marostegui: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393725 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [09:19:17] !log unmask and restart restbase1007-b - T179422 [09:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:24] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [09:22:37] !log restart db1082 [09:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:03] (03PS1) 10Elukey: profile::hadoop::client: move alarms definition to the worker profile [puppet] - 10https://gerrit.wikimedia.org/r/393733 (https://phabricator.wikimedia.org/T167790) [09:31:08] (03CR) 10Jcrespo: [C: 031] Revert "mariadb: Depool db1082 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393724 (owner: 10Jcrespo) [09:31:27] (03CR) 10Jcrespo: [C: 031] "This can happen now, but we will wait for buffer pool to be hot again." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393724 (owner: 10Jcrespo) [09:32:09] (03CR) 10Jcrespo: [C: 04-1] "Actually, let's also wait for the db1095 movement." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393724 (owner: 10Jcrespo) [09:33:02] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9005/" [puppet] - 10https://gerrit.wikimedia.org/r/393733 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [09:33:46] (03PS1) 10Ppchelko: [JobQueue] Only produce wikibase-addUsagesForPage to EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393734 [09:34:04] (03PS2) 10Filippo Giunchedi: role: bump statsd inbound udp error threshold [puppet] - 10https://gerrit.wikimedia.org/r/393607 [09:34:47] (03PS2) 10Ppchelko: [JobQueue] Only produce wikibase-addUsagesForPage to EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393734 [09:35:26] !log restart db1087 for maintenance [09:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:35] (03PS3) 10Filippo Giunchedi: role: bump statsd inbound udp error threshold [puppet] - 10https://gerrit.wikimedia.org/r/393607 [09:36:01] (03CR) 10Filippo Giunchedi: [C: 032] role: bump statsd inbound udp error threshold [puppet] - 10https://gerrit.wikimedia.org/r/393607 (owner: 10Filippo Giunchedi) [09:38:37] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, a couple of minor comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392838 (owner: 10Alexandros Kosiaris) [09:39:43] (03CR) 10Giuseppe Lavagetto: [C: 031] Add kubelet_username, kubeproxy_username hieradata [puppet] - 10https://gerrit.wikimedia.org/r/392839 (owner: 10Alexandros Kosiaris) [09:40:19] (03CR) 10Giuseppe Lavagetto: [C: 031] Use kubelet/kubeproxy specific configs [puppet] - 10https://gerrit.wikimedia.org/r/392842 (owner: 10Alexandros Kosiaris) [09:47:23] (03CR) 10Alexandros Kosiaris: [C: 04-2] "After a short conversation with godog, I 've copied the interesting parts from this dashboard (disk stats that is) into the above mentione" [puppet] - 10https://gerrit.wikimedia.org/r/393696 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [09:52:54] 10Operations, 10Analytics-Kanban, 10hardware-requests: eqiad: (2) hardware request for jupyter notebook refresh (SWAP) - https://phabricator.wikimedia.org/T175603#3791731 (10faidon) a:05Ottomata>03RobH Reassigning, as it sounds like it's in @RobH's hands now (and why it likely fell through the cracks). [09:59:34] (03CR) 10Alexandros Kosiaris: Allow specifying kubelet/kubeproxy username/token (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392838 (owner: 10Alexandros Kosiaris) [10:07:50] !log ppchelko@tin Started deploy [cpjobqueue/deploy@c4b9e16]: Enable wikibase-addUsagesForPage with low concurrency [10:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:19] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@c4b9e16]: Enable wikibase-addUsagesForPage with low concurrency (duration: 00m 29s) [10:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:32] (03PS1) 10Elukey: Rename ::profile::hadoop::client to commons and move some features out [puppet] - 10https://gerrit.wikimedia.org/r/393738 (https://phabricator.wikimedia.org/T167790) [10:12:58] (03CR) 10jerkins-bot: [V: 04-1] Rename ::profile::hadoop::client to commons and move some features out [puppet] - 10https://gerrit.wikimedia.org/r/393738 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [10:14:15] * elukey blames spaces [10:14:25] (03PS2) 10Elukey: Rename ::profile::hadoop::client to commons and move some features out [puppet] - 10https://gerrit.wikimedia.org/r/393738 (https://phabricator.wikimedia.org/T167790) [10:19:34] (03PS1) 10Marostegui: s5.hosts: db1099 now replicates 3318 [software] - 10https://gerrit.wikimedia.org/r/393739 (https://phabricator.wikimedia.org/T178359) [10:21:56] (03CR) 10Marostegui: [C: 032] s5.hosts: db1099 now replicates 3318 [software] - 10https://gerrit.wikimedia.org/r/393739 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:22:37] (03Merged) 10jenkins-bot: s5.hosts: db1099 now replicates 3318 [software] - 10https://gerrit.wikimedia.org/r/393739 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:33:38] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1082 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393724 (owner: 10Jcrespo) [10:33:55] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3791812 (10mobrovac) [10:36:14] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1082 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393724 (owner: 10Jcrespo) [10:36:31] (03PS3) 10Mobrovac: [JobQueue] Only produce wikibase-addUsagesForPage to EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393734 (https://phabricator.wikimedia.org/T175212) (owner: 10Ppchelko) [10:37:08] (03CR) 10Mobrovac: [C: 032] [JobQueue] Only produce wikibase-addUsagesForPage to EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393734 (https://phabricator.wikimedia.org/T175212) (owner: 10Ppchelko) [10:38:33] (03PS1) 10Jcrespo: mariadb: Depool db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393740 (https://phabricator.wikimedia.org/T148078) [10:38:52] (03PS2) 10Jcrespo: mariadb: Depool db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393740 (https://phabricator.wikimedia.org/T148078) [10:39:38] (03Merged) 10jenkins-bot: [JobQueue] Only produce wikibase-addUsagesForPage to EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393734 (https://phabricator.wikimedia.org/T175212) (owner: 10Ppchelko) [10:42:37] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393740 (https://phabricator.wikimedia.org/T148078) (owner: 10Jcrespo) [10:43:55] (03Merged) 10jenkins-bot: mariadb: Depool db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393740 (https://phabricator.wikimedia.org/T148078) (owner: 10Jcrespo) [10:44:08] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Process wikibase-addUsagesForPage only via EventBus - T175212 (duration: 00m 44s) [10:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:15] T175212: Services Q2 2017/18 goal: Migrate a subset of jobs to multi-DC enabled event processing infrastructure. - https://phabricator.wikimedia.org/T175212 [10:44:45] (03PS3) 10Elukey: Rename ::profile::hadoop::client to commons and move some features out [puppet] - 10https://gerrit.wikimedia.org/r/393738 (https://phabricator.wikimedia.org/T167790) [10:44:48] (03PS1) 10Elukey: Move Hadoop common profile hiera settings to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/393741 (https://phabricator.wikimedia.org/T167790) [10:47:07] (03PS2) 10Elukey: Move Hadoop common profile hiera settings to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/393741 (https://phabricator.wikimedia.org/T167790) [10:49:44] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1082. Depool db1044 (duration: 00m 44s) [10:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:36] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1082 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393724 (owner: 10Jcrespo) [10:52:38] (03CR) 10jenkins-bot: [JobQueue] Only produce wikibase-addUsagesForPage to EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393734 (https://phabricator.wikimedia.org/T175212) (owner: 10Ppchelko) [10:52:40] (03CR) 10jenkins-bot: mariadb: Depool db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393740 (https://phabricator.wikimedia.org/T148078) (owner: 10Jcrespo) [10:55:02] (03CR) 10Ema: [C: 031] "LGTM!" [debs/pybal] - 10https://gerrit.wikimedia.org/r/393066 (https://phabricator.wikimedia.org/T180069) (owner: 10Mark Bergsma) [10:55:59] !log stop db1044 replication for db1095 master switchover [10:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:41] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3791910 (10Gilles) I've found the explanation [[ https://feryn.eu/blog/varnishlog-measure-varnish-cache-performance/#The_Timestamp_tag | here ]]. What's log... [11:02:17] 10Operations, 10Prod-Kubernetes, 10monitoring, 10Kubernetes, and 2 others: Gaps in kubelet-reported Prometheus metrics - https://phabricator.wikimedia.org/T181489#3791918 (10fgiunchedi) [11:06:07] (03PS3) 10Elukey: profile::hadoop::common,profile::hive::client: move hiera config in one place [puppet] - 10https://gerrit.wikimedia.org/r/393741 (https://phabricator.wikimedia.org/T167790) [11:11:02] (03PS4) 10Elukey: profile::hadoop::common,profile::hive::client: move hiera config in one place [puppet] - 10https://gerrit.wikimedia.org/r/393741 (https://phabricator.wikimedia.org/T167790) [11:25:00] !log Manually re-started Wikidata JSON dumps on snapshot1007, got stuck after db1082 went down. [11:25:05] apergos: FYI ^ [11:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:09] I got to go [11:25:27] automatic re-starting seems broken, I'll look into that *sigh* [11:37:07] (03PS1) 10Jcrespo: mariadb: Disable all notifications on db1044, preparing to decom [puppet] - 10https://gerrit.wikimedia.org/r/393745 (https://phabricator.wikimedia.org/T148078) [11:41:04] (03PS2) 10Jcrespo: mariadb: Disable all notifications on db1044, preparing to decom [puppet] - 10https://gerrit.wikimedia.org/r/393745 (https://phabricator.wikimedia.org/T148078) [11:41:50] (03CR) 10Jcrespo: [C: 032] mariadb: Disable all notifications on db1044, preparing to decom [puppet] - 10https://gerrit.wikimedia.org/r/393745 (https://phabricator.wikimedia.org/T148078) (owner: 10Jcrespo) [11:48:03] (03PS1) 10Jcrespo: mariadb: Decommission db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393747 (https://phabricator.wikimedia.org/T148078) [11:50:27] (03PS1) 10Giuseppe Lavagetto: jobrunner: drop number of "basic" jobs in favour of html ones [puppet] - 10https://gerrit.wikimedia.org/r/393748 [11:50:40] <_joe_> jynus, marostegui ^^ this will add more refreshLinks (doesn't touch the db) and htmlCacheupdate runners [11:51:01] <_joe_> I think it's conservative enough not to create issues, but I'll ping you when we deploy it [11:53:12] _joe_: that is ok to me [11:53:33] we normally do not have issues coming from the jobqueue [11:53:59] except I think sometimes they have so much local load (on client, app servers) that they can timeout on connection [11:54:03] <_joe_> I know, the jobqueue is quite good at auto-throttling [11:54:05] but that is not on our side [11:54:22] <_joe_> involuntarily so, but I digress :) [11:55:41] (03PS3) 10Giuseppe Lavagetto: profile::puppetmaster::common: Always enable environments [puppet] - 10https://gerrit.wikimedia.org/r/393677 (owner: 10Andrew Bogott) [12:01:13] 10Puppet, 10Wikimedia-Language-setup, 10Patch-For-Review, 10User-MarcoAurelio, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3792139 (10MarcoAurelio) [12:02:37] (03PS1) 10Filippo Giunchedi: graphite: blackhole spam from wanobjectcache [puppet] - 10https://gerrit.wikimedia.org/r/393749 (https://phabricator.wikimedia.org/T178531) [12:04:41] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3792142 (10Gilles) Slight correction: the fields start at 1, so Timestamp:Resp[2] is correct for total time spent in Varnish. We could log more, though. [12:08:20] (03CR) 10Filippo Giunchedi: "PCC is happy https://puppet-compiler.wmflabs.org/compiler02/9011/graphite1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/393749 (https://phabricator.wikimedia.org/T178531) (owner: 10Filippo Giunchedi) [12:11:24] (03PS2) 10Filippo Giunchedi: graphite: blackhole spam from wanobjectcache [puppet] - 10https://gerrit.wikimedia.org/r/393749 (https://phabricator.wikimedia.org/T178531) [12:12:05] (03CR) 10Filippo Giunchedi: [C: 032] graphite: blackhole spam from wanobjectcache [puppet] - 10https://gerrit.wikimedia.org/r/393749 (https://phabricator.wikimedia.org/T178531) (owner: 10Filippo Giunchedi) [12:14:22] (03PS1) 10Gilles: Log more detailed info in Varnish slow request log [puppet] - 10https://gerrit.wikimedia.org/r/393751 (https://phabricator.wikimedia.org/T181315) [12:14:38] !log bounce carbon-frontend-relay after https://gerrit.wikimedia.org/r/393749 [12:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:22] (03CR) 10Paladox: "I am going to move this to the gerrit repo where we do scap. As we can do this as a plugin now." [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [12:16:49] (03PS1) 10Filippo Giunchedi: graphite: fix wanobjectcache regex [puppet] - 10https://gerrit.wikimedia.org/r/393752 [12:18:02] (03PS2) 10Filippo Giunchedi: graphite: fix wanobjectcache regex [puppet] - 10https://gerrit.wikimedia.org/r/393752 [12:18:41] (03CR) 10Filippo Giunchedi: [C: 032] graphite: fix wanobjectcache regex [puppet] - 10https://gerrit.wikimedia.org/r/393752 (owner: 10Filippo Giunchedi) [12:20:00] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3792183 (10Gilles) Going back to the context of this task, the request that took more than a minute in the HAR should have been seen by... [12:25:55] !log cleanup wanobjectcache metrics with hashes - T178531 [12:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:06] T178531: Add statsd metric to WANObjectCache - https://phabricator.wikimedia.org/T178531 [12:33:17] (03Draft1) 10Paladox: Gerrit: Add wmf branding to PolyGerrit [software/gerrit] - 10https://gerrit.wikimedia.org/r/393753 [12:33:19] (03PS2) 10Paladox: Gerrit: Add wmf branding to PolyGerrit [software/gerrit] - 10https://gerrit.wikimedia.org/r/393753 [12:36:42] (03PS1) 10Jcrespo: mariadb: Update s5-master and add s8-master CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/393755 (https://phabricator.wikimedia.org/T177208) [12:42:55] (03PS3) 10Paladox: Gerrit: Add wmf branding to PolyGerrit [software/gerrit] - 10https://gerrit.wikimedia.org/r/393753 [12:47:06] (03PS4) 10Elukey: Rename ::profile::hadoop::client to commons and move some features out [puppet] - 10https://gerrit.wikimedia.org/r/393738 (https://phabricator.wikimedia.org/T167790) [12:47:08] (03PS5) 10Elukey: profile::hadoop::common,profile::hive::client: move hiera config in one place [puppet] - 10https://gerrit.wikimedia.org/r/393741 (https://phabricator.wikimedia.org/T167790) [12:47:10] (03PS1) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) [12:47:48] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [12:49:38] (03PS2) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) [12:52:51] (03CR) 10Marostegui: [C: 031] mariadb: Update s5-master and add s8-master CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/393755 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [12:53:19] (03PS3) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) [13:02:31] (03PS4) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) [13:03:41] (03CR) 10Elukey: [C: 031] jobrunner: drop number of "basic" jobs in favour of html ones [puppet] - 10https://gerrit.wikimedia.org/r/393748 (owner: 10Giuseppe Lavagetto) [13:07:59] (03CR) 10Marostegui: "I would wait a few days to make sure db1072 works fine before fully decommissioning" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393747 (https://phabricator.wikimedia.org/T148078) (owner: 10Jcrespo) [13:08:14] (03PS5) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) [13:12:00] (03PS1) 10KartikMistry: apertium-tur: New upstream release [debs/contenttranslation/apertium-tur] - 10https://gerrit.wikimedia.org/r/393758 (https://phabricator.wikimedia.org/T181465) [13:22:14] (03PS6) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) [13:39:20] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed o [13:39:21] e was received [13:39:26] (03CR) 10Elukey: "Cumulative pcc seems to be a no-op: https://puppet-compiler.wmflabs.org/compiler02/9019/" [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [13:40:11] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [13:44:40] (03PS3) 10Elukey: Kafka: Enable topic deletion for Kafka by default [puppet] - 10https://gerrit.wikimedia.org/r/349280 (https://phabricator.wikimedia.org/T163392) (owner: 10Ppchelko) [13:48:50] (03PS1) 10Jon Harald Søby: Add category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393762 (https://phabricator.wikimedia.org/T181503) [13:49:26] (03CR) 10Elukey: [C: 032] Kafka: Enable topic deletion for Kafka by default [puppet] - 10https://gerrit.wikimedia.org/r/349280 (https://phabricator.wikimedia.org/T163392) (owner: 10Ppchelko) [14:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171128T1400). [14:00:04] No GERRIT patches in the queue for this window AFAICS. [14:00:45] \o/ [14:03:11] !log reboot kafka200[123] for kernel + jvm updates - T179943 [14:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:18] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [14:07:26] (03CR) 10Filippo Giunchedi: Add postgresql::prometheus class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392438 (https://phabricator.wikimedia.org/T177196) (owner: 10Alexandros Kosiaris) [14:07:53] (03PS2) 10Filippo Giunchedi: redis: use hostname not fqdn in redis_exporter [puppet] - 10https://gerrit.wikimedia.org/r/393605 (https://phabricator.wikimedia.org/T148637) [14:08:46] (03CR) 10Filippo Giunchedi: [C: 032] redis: use hostname not fqdn in redis_exporter [puppet] - 10https://gerrit.wikimedia.org/r/393605 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [14:14:17] (03PS4) 10Paladox: Gerrit: Add wmf branding to PolyGerrit [software/gerrit] - 10https://gerrit.wikimedia.org/r/393753 [14:14:26] (03CR) 10Paladox: "This is what it looks like https://phabricator.wikimedia.org/F11050246 :)" [software/gerrit] - 10https://gerrit.wikimedia.org/r/393753 (owner: 10Paladox) [14:14:31] zeljkof: I am going to deploy db-eqiad.php if there are no swat today? [14:14:35] *there is [14:14:59] (03CR) 10Filippo Giunchedi: [C: 031] Restrict access to ferm service on mwlog* hosts [puppet] - 10https://gerrit.wikimedia.org/r/393240 (owner: 10Muehlenhoff) [14:15:03] marostegui: no swat as far as I know, go ahead [14:15:15] (03CR) 10Paladox: "and for mobile https://phabricator.wikimedia.org/F11050258" [software/gerrit] - 10https://gerrit.wikimedia.org/r/393753 (owner: 10Paladox) [14:16:51] (03PS1) 10Marostegui: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393768 (https://phabricator.wikimedia.org/T178359) [14:16:55] zeljkof: thanks! [14:17:27] (03CR) 10Ottomata: [C: 031] "+1!" [puppet] - 10https://gerrit.wikimedia.org/r/393738 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:17:34] !log reboot kafka10[12-22] for kernel + jvm updates - T179943 [14:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:43] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [14:18:41] (03CR) 10Ottomata: [C: 031] "Yay for DRY!" [puppet] - 10https://gerrit.wikimedia.org/r/393741 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:18:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393768 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [14:20:15] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393768 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [14:20:26] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393768 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [14:21:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 - T178359 (duration: 00m 44s) [14:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:26] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [14:21:27] (03CR) 10Jcrespo: "To be fair, I have not touched db1072, and db1044 has been depooled for long- if something could break, that would be db1095." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393747 (https://phabricator.wikimedia.org/T148078) (owner: 10Jcrespo) [14:26:17] (03CR) 10Imarlier: "Looks to me like this will do what we want -- awesome." [puppet] - 10https://gerrit.wikimedia.org/r/393751 (https://phabricator.wikimedia.org/T181315) (owner: 10Gilles) [14:28:38] (03CR) 10Marostegui: "> To be fair, I have not touched db1072, and db1044 has been depooled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393747 (https://phabricator.wikimedia.org/T148078) (owner: 10Jcrespo) [14:28:45] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3792547 (10Cmjohnson) On the server, the LED indicator for the power supply is green and not showing any signs of a problem. I even removed power from each side to ensu... [14:30:00] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table dewiki.pagelinks: Cant find record in pagelinks, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001534, end_log_pos 453556318 [14:33:30] ^I am taking care of that [14:34:42] !log gehel@tin Started deploy [kartotherian/deploy@55c5da4]: (no justification provided) [14:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:53] !log gehel@tin Finished deploy [kartotherian/deploy@55c5da4]: (no justification provided) (duration: 00m 11s) [14:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:55] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T181028#3792589 (10Cmjohnson) A ticket has been created with HP support. They will ask for a report sometime today and hopefully ship it out today. Case ID: 5324992816 [14:37:16] (03CR) 10Jon Harald Søby: "After this is merged, the deployer should run the following maintenance script:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393762 (https://phabricator.wikimedia.org/T181503) (owner: 10Jon Harald Søby) [14:37:31] (03CR) 10Cmjohnson: [C: 032] adding mgmt dns entries for db111[12] T180788 [dns] - 10https://gerrit.wikimedia.org/r/392702 (owner: 10Cmjohnson) [14:37:52] !log Stop MySQL on db1055 to clone db1099:3311 - T178359 [14:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:59] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [14:39:10] !log gehel@tin Started deploy [kartotherian/deploy@cb9b1ef]: testing new kartotherian packaging on maps-test2003 [14:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:23] (03CR) 10Giuseppe Lavagetto: [C: 031] profile: add redis_exporter to redis multidc [puppet] - 10https://gerrit.wikimedia.org/r/393606 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [14:39:28] !log gehel@tin Finished deploy [kartotherian/deploy@cb9b1ef]: testing new kartotherian packaging on maps-test2003 (duration: 00m 18s) [14:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:34] (03PS2) 10Herron: puppet: point codfw and canary cp servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/392676 (https://phabricator.wikimedia.org/T177254) [14:40:10] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Package openldap collector for Prometheus and adapt metrics - https://phabricator.wikimedia.org/T181511#3792600 (10MoritzMuehlenhoff) [14:40:47] !log otto@tin Started deploy [eventlogging/analytics@c464b8c]: Fixing bug where userAgent set by client producer was not used [14:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:54] !log otto@tin Started deploy [eventlogging/analytics@c464b8c]: Fixing bug where userAgent set by client producer was not used T178440 [14:40:56] !log otto@tin Finished deploy [eventlogging/analytics@c464b8c]: Fixing bug where userAgent set by client producer was not used T178440 (duration: 00m 02s) [14:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:00] T178440: Refine should parse user agent field as it is done on refinery pipeline - https://phabricator.wikimedia.org/T178440 [14:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:40] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 873.24 seconds [14:41:44] (03PS2) 10Ema: Log more detailed info in Varnish slow request log [puppet] - 10https://gerrit.wikimedia.org/r/393751 (https://phabricator.wikimedia.org/T181315) (owner: 10Gilles) [14:42:12] (03CR) 10Ema: [V: 032 C: 032] Log more detailed info in Varnish slow request log [puppet] - 10https://gerrit.wikimedia.org/r/393751 (https://phabricator.wikimedia.org/T181315) (owner: 10Gilles) [14:44:01] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:45:49] !log gehel@tin Started deploy [kartotherian/deploy@cb9b1ef]: testing new kartotherian packaging on maps-test2003 [14:45:52] !log gehel@tin Finished deploy [kartotherian/deploy@cb9b1ef]: testing new kartotherian packaging on maps-test2003 (duration: 00m 02s) [14:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:57] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/9022/" [puppet] - 10https://gerrit.wikimedia.org/r/392676 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [14:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:40] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 141.84 seconds [14:47:50] (03Abandoned) 10Paladox: Gerrit: Add wmf branding to PolyGerrit [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [14:50:22] (03PS2) 10Jcrespo: mariadb: Update s5-master and add s8-master CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/393755 (https://phabricator.wikimedia.org/T177208) [14:50:37] (03CR) 10Jcrespo: [C: 032] mariadb: Update s5-master and add s8-master CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/393755 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [14:51:47] Can I do a deployment for SWAT? [14:51:55] I know it's very late [14:52:00] zeljkof: ^ [14:52:29] Amir1: you can as far as I am concerned :) [14:52:46] Awesome, I'm pretty it\s very fast [14:52:54] marostegui: done with deployment? Amir1 wants to deploy something [14:53:09] yeah! [14:53:29] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393679 (https://phabricator.wikimedia.org/T180450) (owner: 10Ladsgroup) [14:53:41] (03PS1) 10Cmjohnson: Removing mentions of decom server strontium [puppet] - 10https://gerrit.wikimedia.org/r/393771 [14:53:41] (03PS1) 10Cmjohnson: Merge branch 'production' of https://gerrit.wikimedia.org/r/p/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/393772 [14:54:05] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'production' of https://gerrit.wikimedia.org/r/p/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/393772 (owner: 10Cmjohnson) [14:54:28] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#3792646 (10jcrespo) [14:54:30] (03Abandoned) 10Cmjohnson: Removing mentions of decom server strontium [puppet] - 10https://gerrit.wikimedia.org/r/393771 (owner: 10Cmjohnson) [14:54:37] (03Abandoned) 10Cmjohnson: Merge branch 'production' of https://gerrit.wikimedia.org/r/p/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/393772 (owner: 10Cmjohnson) [14:54:42] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#2714228 (10jcrespo) a:03Marostegui [14:55:35] (03PS2) 10Ladsgroup: Revert "Comply wikidata with new ores thresholds" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393679 (https://phabricator.wikimedia.org/T180450) [14:55:47] (03CR) 10Ladsgroup: [C: 032] Revert "Comply wikidata with new ores thresholds" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393679 (https://phabricator.wikimedia.org/T180450) (owner: 10Ladsgroup) [14:56:18] !log Deploy schema change on s3 on dbstore1002 - T174569 [14:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:25] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [14:56:28] (03PS1) 10Cmjohnson: Adding dhcpd entries for mw1329-1337 [puppet] - 10https://gerrit.wikimedia.org/r/393773 [14:56:49] (03PS3) 10Filippo Giunchedi: profile: add redis_exporter to redis multidc [puppet] - 10https://gerrit.wikimedia.org/r/393606 (https://phabricator.wikimedia.org/T148637) [14:57:05] (03Merged) 10jenkins-bot: Revert "Comply wikidata with new ores thresholds" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393679 (https://phabricator.wikimedia.org/T180450) (owner: 10Ladsgroup) [14:57:14] (03CR) 10jenkins-bot: Revert "Comply wikidata with new ores thresholds" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393679 (https://phabricator.wikimedia.org/T180450) (owner: 10Ladsgroup) [14:57:36] Amir1: cool! [14:58:25] (03CR) 10Filippo Giunchedi: [C: 032] profile: add redis_exporter to redis multidc [puppet] - 10https://gerrit.wikimedia.org/r/393606 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [14:59:26] (03PS1) 10Cmjohnson: mend [puppet] - 10https://gerrit.wikimedia.org/r/393774 [15:01:04] (03PS2) 10Cmjohnson: Adding dhcpd entries for mw1329-1337 [puppet] - 10https://gerrit.wikimedia.org/r/393773 [15:02:51] (03CR) 10Cmjohnson: [C: 032] Adding dhcpd entries for mw1329-1337 [puppet] - 10https://gerrit.wikimedia.org/r/393773 (owner: 10Cmjohnson) [15:04:11] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: Revert "Comply wikidata with new ores thresholds" (duration: 00m 45s) [15:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:22] awight: okay, The change disables the thresholds [15:05:37] not all but for two or three cases [15:06:10] Amir1: Ah I was thinking this was something to compensate for the new wikidat editquality model [15:06:40] !log Compress s5 on db1099 [15:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:04] This fixes some problems but not in a robust way [15:07:09] we need to find a solution [15:07:19] 10Operations, 10puppet-compiler: puppet compiler fail compilation on manifests using puppetdb - https://phabricator.wikimedia.org/T180671#3792664 (10Joe) Turns out I was correct; I went the nuclear option and I'm currently rebuilding the db on both hosts - it was broken everywhere. I guess we can kind of expec... [15:08:37] (03PS2) 10Giuseppe Lavagetto: jobrunner: drop number of "basic" jobs in favour of html ones [puppet] - 10https://gerrit.wikimedia.org/r/393748 [15:08:46] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: drop number of "basic" jobs in favour of html ones [puppet] - 10https://gerrit.wikimedia.org/r/393748 (owner: 10Giuseppe Lavagetto) [15:16:11] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [15:16:43] this is me --^ [15:16:51] too late adding the downtime [15:17:01] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [15:17:50] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [15:24:20] (03CR) 10Anomie: filtered_tables: Add new columns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393725 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [15:24:46] (03PS3) 10Herron: puppet: point codfw and canary cp servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/392676 (https://phabricator.wikimedia.org/T177254) [15:25:11] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 65.22% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [15:25:30] PROBLEM - IPsec on cp4030 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [15:25:30] PROBLEM - IPsec on cp4032 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [15:25:30] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [15:25:30] PROBLEM - IPsec on cp4028 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [15:25:30] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [15:25:30] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [15:25:31] PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [15:25:31] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [15:25:32] PROBLEM - IPsec on cp4027 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [15:25:32] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [15:25:33] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [15:25:33] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [15:25:50] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [15:25:50] PROBLEM - IPsec on cp4031 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [15:25:50] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [15:25:51] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [15:25:51] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [15:25:51] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 24 not-conn: kafka1018_v4,kafka1018_v6 [15:25:51] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [15:25:51] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [15:25:52] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1018_v4,kafka1018_v6 [15:25:52] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [15:25:53] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [15:25:53] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [15:26:03] ahhh I was missing the Ipsec shower [15:26:06] sorry people [15:26:10] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1018_v4,kafka1018_v6 [15:26:10] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [15:26:11] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1018_v4,kafka1018_v6 [15:26:11] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 24 not-conn: kafka1018_v4,kafka1018_v6 [15:26:11] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1018_v4,kafka1018_v6 [15:26:11] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1018_v4,kafka1018_v6 [15:26:16] (03PS4) 10Herron: puppet: point codfw misc and canary cp hosts at codfw puppet4 masters [puppet] - 10https://gerrit.wikimedia.org/r/392676 (https://phabricator.wikimedia.org/T177254) [15:26:20] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 24 not-conn: kafka1018_v4,kafka1018_v6 [15:26:20] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [15:26:20] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [15:26:21] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [15:27:21] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [15:28:12] kafka1018 doesn't like rebooting [15:29:40] PROBLEM - Check whether ferm is active by checking the default input chain on db1110 is CRITICAL: Return code of 255 is out of bounds [15:29:50] PROBLEM - Disk space on db1110 is CRITICAL: Return code of 255 is out of bounds [15:29:53] (03CR) 10Jcrespo: "Then we should add them, and check if the script white lists or black lists. Probably merge with the private_tables on realm.txt in the fu" [puppet] - 10https://gerrit.wikimedia.org/r/393725 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [15:30:00] PROBLEM - dhclient process on db1110 is CRITICAL: Return code of 255 is out of bounds [15:30:06] PROBLEM - mysqld processes on db1110 is CRITICAL: Return code of 255 is out of bounds [15:30:11] PROBLEM - MariaDB Slave IO: s5 on db1110 is CRITICAL: Return code of 255 is out of bounds [15:30:20] <_joe_> uhm [15:30:20] what's up? [15:30:30] that doesn't look good [15:30:30] <_joe_> this might have to do with my change? [15:30:42] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1956 bytes in 0.087 second response time [15:30:53] RECOVERY - Check whether ferm is active by checking the default input chain on db1110 is OK: OK ferm input default policy is set [15:30:58] is db1110 independent of the kafka1018/ipsec issue? [15:31:02] RECOVERY - Disk space on db1110 is OK: DISK OK [15:31:04] is that that is the largest s5 host [15:31:05] <_joe_> jynus, marostegui I was about to tell you the change to the jobrunners should be online [15:31:13] RECOVERY - dhclient process on db1110 is OK: PROCS OK: 0 processes with command name dhclient [15:31:18] RECOVERY - mysqld processes on db1110 is OK: PROCS OK: 1 process with command name mysqld [15:31:23] RECOVERY - MariaDB Slave IO: s5 on db1110 is OK: OK slave_io_state Slave_IO_Running: Yes [15:31:31] did you merge firewall changes? [15:31:37] <_joe_> jynus: nope [15:31:43] <_joe_> just changed the number of runners [15:31:45] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [15:31:48] then it is not that [15:32:26] Waiting in MASTER_GTID_WAIT() [15:32:29] it is overloaded [15:32:56] <_joe_> should I revert? that's the only thing that changed in the last few minutes [15:33:01] is there something I can do to help, or we already know what's at the bottom of this? [15:33:18] I have no idea what is happening [15:33:21] mysql is up [15:33:27] with thousands of connections [15:33:35] the other thing ongoing is kafka1018 that is not booting up, but except from spamming icinga it shouldn't be related at all [15:33:43] jynus: where? I can see only 253 [15:33:53] on db1110 [15:34:01] (03PS1) 10Giuseppe Lavagetto: Revert "jobrunner: drop number of "basic" jobs in favour of html ones" [puppet] - 10https://gerrit.wikimedia.org/r/393779 [15:34:09] <_joe_> coming from where? [15:34:13] volans: obviously it wasn't like that a few minutes ago [15:34:39] <_joe_> jynus: if you say the word, I will rollback. Is that server still overloaded? [15:35:11] the MW fatals are all DB-related of course, just confirming [15:35:21] /includes/libs/rdbms/lbfactory/LBFactory.php: Could not wait for replica DBs to catch up to db1070 [15:35:27] database query error has occurred. Did you forget to run your application's database schema upda [15:35:30] etc, etc... [15:35:36] Cannot access the database: Unknown error (10.64.32.31) [15:35:40] that is the default error [15:35:45] it wen't down again [15:36:15] yeah that spike seems to have been brief (the exceptions/fatals) [15:36:27] cpu usage went down at 15:23 apparently [15:36:35] <_joe_> wtf? [15:36:36] <_joe_> dig -x 10.64.32.31 [15:36:36] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=db1110&var-datasource=eqiad%20prometheus%2Fops&from=now-1h&to=now [15:36:41] <_joe_> 31.32.64.10.in-addr.arpa. 2396 IN PTR mw1329.eqiad.wmnet. [15:36:41] <_joe_> 31.32.64.10.in-addr.arpa. 2396 IN PTR db1110.eqiad.wmnet. [15:36:47] <_joe_> wat? [15:36:50] it seems now it asks for a password to log in, or someoone did a man in the middle [15:36:56] yeah that could be a problem [15:36:57] <_joe_> jynus: see above [15:37:03] I am depooling [15:37:07] from mediawiki [15:37:10] PROBLEM - MariaDB disk space on db1110 is CRITICAL: Return code of 255 is out of bounds [15:37:14] PROBLEM - Check whether ferm is active by checking the default input chain on db1110 is CRITICAL: Return code of 255 is out of bounds [15:37:19] <_joe_> jynus: yeah gonna fix this now [15:37:24] PROBLEM - Disk space on db1110 is CRITICAL: Return code of 255 is out of bounds [15:37:34] PROBLEM - dhclient process on db1110 is CRITICAL: Return code of 255 is out of bounds [15:37:39] PROBLEM - mysqld processes on db1110 is CRITICAL: Return code of 255 is out of bounds [15:37:40] those two hosts have identical IPs in DNS [15:37:45] PROBLEM - MariaDB Slave IO: s5 on db1110 is CRITICAL: Return code of 255 is out of bounds [15:37:45] :( [15:37:46] I'd say poweroff mw1329 pronto [15:37:49] <_joe_> can someone kill mw1329? [15:37:50] +! [15:38:22] RECOVERY - MariaDB disk space on db1110 is OK: DISK OK [15:38:26] RECOVERY - Check whether ferm is active by checking the default input chain on db1110 is OK: OK ferm input default policy is set [15:38:27] !log powered off mw1329 [15:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:34] it doesn't help [15:38:35] it is a new host [15:38:36] RECOVERY - Disk space on db1110 is OK: DISK OK [15:38:37] https://gerrit.wikimedia.org/r/#/c/393773/ [15:38:42] that if one slave goes down [15:38:43] give it a sec [15:38:49] the whole wiki goes down [15:38:51] RECOVERY - mysqld processes on db1110 is OK: PROCS OK: 1 process with command name mysqld [15:38:52] RECOVERY - dhclient process on db1110 is OK: PROCS OK: 0 processes with command name dhclient [15:38:57] RECOVERY - MariaDB Slave IO: s5 on db1110 is OK: OK slave_io_state Slave_IO_Running: Yes [15:38:58] <_joe_> of course it's back [15:39:02] !log jynus@tin Synchronized wmf-config/db-eqiad.php: depool db1110 (duration: 00m 44s) [15:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:17] _joe_ mw1329 was part of https://gerrit.wikimedia.org/r/#/c/393773/2/modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 [15:39:22] mediawiki deploy is more important [15:39:30] the problem is secondary [15:39:35] (solving it) [15:39:44] <_joe_> jynus: no, we actually already solved it [15:39:54] (03PS1) 10Cmjohnson: Adding production dns for db1111/2 [dns] - 10https://gerrit.wikimedia.org/r/393781 [15:40:03] https://gerrit.wikimedia.org/r/#/c/392650/1/templates/10.in-addr.arpa [15:40:14] cmjohnson1: please don't merge [15:40:35] (03CR) 10Elukey: "Please don't merge this now" [dns] - 10https://gerrit.wikimedia.org/r/393781 (owner: 10Cmjohnson) [15:40:39] okay [15:40:44] ah ok you are here :) [15:41:13] let's move mw1329 to .66 [15:41:16] <_joe_> elukey: are those servers in production in any way? [15:41:19] <_joe_> volans: stop [15:41:26] _joe_ new appservers, last batch, nope [15:41:40] <_joe_> https://gerrit.wikimedia.org/r/#/c/392650/1 [15:41:43] <_joe_> it's weird [15:41:52] <_joe_> it says 61-69 in the commit message [15:42:05] "This is the update for mw1329-37 not 1361-69" [15:42:12] query problems fixed: https://logstash.wikimedia.org/goto/e303956866cd2bce403481f04327b8bb [15:42:13] <_joe_> meh [15:42:18] <_joe_> elukey: it's safe to revert then [15:42:22] I will now deploy the depool apropiately [15:42:28] shit sorry I see what I did there [15:42:39] (03PS1) 10Giuseppe Lavagetto: Revert "Adding dns entries for new mw hosts mw13[61-69] T165519" [dns] - 10https://gerrit.wikimedia.org/r/393782 [15:42:44] (03PS2) 10Giuseppe Lavagetto: Revert "Adding dns entries for new mw hosts mw13[61-69] T165519" [dns] - 10https://gerrit.wikimedia.org/r/393782 [15:42:59] <_joe_> anyone care to give me a +1? [15:43:27] (03CR) 10Volans: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/393782 (owner: 10Giuseppe Lavagetto) [15:43:38] (03CR) 10Elukey: [C: 031] Revert "Adding dns entries for new mw hosts mw13[61-69] T165519" [dns] - 10https://gerrit.wikimedia.org/r/393782 (owner: 10Giuseppe Lavagetto) [15:44:14] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "Adding dns entries for new mw hosts mw13[61-69] T165519" [dns] - 10https://gerrit.wikimedia.org/r/393782 (owner: 10Giuseppe Lavagetto) [15:44:19] task is https://phabricator.wikimedia.org/T165519 [15:44:52] so once the dhcp was merged, overlapping ips ? [15:45:19] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [15:45:48] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1934 bytes in 0.095 second response time [15:47:21] <_joe_> elukey: once someone started the server [15:47:27] <_joe_> it does pxe => dhcp [15:47:30] ah yes sure [15:48:07] in the meantime, kafka1018 has horrible errors in the getsel [15:48:08] so nice [15:48:42] jynus, marostegui: For my updating of [[m:Community_Tech/Edit_summary_length_for_non-Latin_languages]], do you have a rough estimate for when the schema change might be done for s3 (or for testwiki, test2wiki, and mediawikiwiki in particular)? [15:49:01] anomie: I will try to get it done by this week and next week for s3 only [15:49:25] anomie: we only have eqiad hosts pending (and there are not many of them) so hopefully by next week [15:49:33] marostegui: Thanks. [15:50:21] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3792786 (10thcipriani) 05Open>03Resolved [15:51:08] anomie: I gave you one for the 100% [15:51:17] on my comment, other shards are not blocked [15:51:49] and we can prioritize group0 ones first [15:52:53] (03PS1) 10Filippo Giunchedi: role: use scalar() to check statsd udp loss [puppet] - 10https://gerrit.wikimedia.org/r/393784 (https://phabricator.wikimedia.org/T181382) [15:55:07] marostegui, jynus: Updated at https://meta.wikimedia.org/wiki/Community_Tech/Edit_summary_length_for_non-Latin_languages#November_28.2C_2017. Feel free to edit if you'd like. [15:55:20] anomie: thanks [15:55:21] cmjohnson1: not sure how busy you are, but if you are in the dc today would you mind to check kafka1018? After the reboot it seems dead and powercycle/hardreset do not really work.. getsel is full of horrible things :D [15:55:33] I am going to open a phab task [15:55:53] (03CR) 10Filippo Giunchedi: [C: 032] role: use scalar() to check statsd udp loss [puppet] - 10https://gerrit.wikimedia.org/r/393784 (https://phabricator.wikimedia.org/T181382) (owner: 10Filippo Giunchedi) [15:56:11] anomie: thanks - mid january for all the shards is not realistic. I will change that to mid Feb for now [15:56:21] We have christmas in between and big wikis take almost one day per server [15:56:54] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#3792806 (10awight) [15:58:42] marostegui: Ok. https://phabricator.wikimedia.org/T166733#3792679 said mid-January, which is where I got that from. [15:58:58] anomie: that was my faulty [15:59:06] I think that means mid-january the blockage [15:59:11] realisticly as manual says [15:59:19] s5 will take more to deploy [15:59:47] anomie: apologies I was unclear [16:00:16] marostegui: is is 13 January where we will get unblocked, right? [16:00:24] elukey: sur [16:00:26] sure [16:00:40] jynus: you mean the failover? I proposed the 9th [16:00:45] ok, so 9 [16:00:53] plus the time to actually do the schema change [16:00:57] indeed [16:01:14] anomie: and as you can see, new outages and issues happen that delay things [16:01:32] so february is more realistic, sorry [16:02:03] I am blocking tin deploys [16:02:30] ping me if you are doing somthing until I upload a new patch [16:02:33] (on tin) [16:02:43] Don't need tin right now, so no worries [16:02:51] I will ping you if I need it (I might later) [16:04:13] 10Operations, 10ops-eqiad, 10Analytics-Kanban: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3792843 (10elukey) [16:04:30] cmjohnson1: thanks! opened --^ [16:04:56] 10Operations, 10Mail: tls expiry check for mx vs acme-setup renewal period - https://phabricator.wikimedia.org/T181519#3792856 (10fgiunchedi) [16:08:36] (03PS1) 10BBlack: [Corrected] Adding dns entries for new mw hosts mw13[29-37] T165519 [dns] - 10https://gerrit.wikimedia.org/r/393787 (https://phabricator.wikimedia.org/T165519) [16:09:48] (03PS1) 10Jcrespo: mariadb: Depool db1110 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393789 (https://phabricator.wikimedia.org/T165519) [16:12:01] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1110 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393789 (https://phabricator.wikimedia.org/T165519) (owner: 10Jcrespo) [16:12:37] (03CR) 10Volans: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/393787 (https://phabricator.wikimedia.org/T165519) (owner: 10BBlack) [16:13:13] elukey: you broke it :-P [16:13:39] (03Merged) 10jenkins-bot: mariadb: Depool db1110 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393789 (https://phabricator.wikimedia.org/T165519) (owner: 10Jcrespo) [16:13:41] (03CR) 10BBlack: [C: 032] [Corrected] Adding dns entries for new mw hosts mw13[29-37] T165519 [dns] - 10https://gerrit.wikimedia.org/r/393787 (https://phabricator.wikimedia.org/T165519) (owner: 10BBlack) [16:13:57] (03CR) 10jenkins-bot: mariadb: Depool db1110 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393789 (https://phabricator.wikimedia.org/T165519) (owner: 10Jcrespo) [16:14:02] volans: I know! [16:17:23] (03CR) 10Giuseppe Lavagetto: [C: 031] profile::puppetmaster::common: Always enable environments [puppet] - 10https://gerrit.wikimedia.org/r/393677 (owner: 10Andrew Bogott) [16:17:50] (03CR) 10Giuseppe Lavagetto: [C: 031] puppetmaster::standalone: include environment env [puppet] - 10https://gerrit.wikimedia.org/r/393678 (owner: 10Andrew Bogott) [16:17:52] PROBLEM - pdfrender on scb1001 is CRITICAL: HTTP CRITICAL - No data received from host [16:18:02] PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused [16:18:52] PROBLEM - Host kafka1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:19:11] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.092 second response time [16:19:51] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [16:23:11] (03PS1) 10Filippo Giunchedi: role: split prometheus redis jobs [puppet] - 10https://gerrit.wikimedia.org/r/393794 (https://phabricator.wikimedia.org/T148637) [16:23:30] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1110 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393795 [16:23:36] (03CR) 10jerkins-bot: [V: 04-1] role: split prometheus redis jobs [puppet] - 10https://gerrit.wikimedia.org/r/393794 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [16:25:04] (03PS2) 10Filippo Giunchedi: role: split prometheus redis jobs [puppet] - 10https://gerrit.wikimedia.org/r/393794 (https://phabricator.wikimedia.org/T148637) [16:25:25] !log mw1329 boot to PXE (should come up with new .66 IP) [16:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:40] [it did] [16:34:42] RECOVERY - Host kafka1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [16:36:02] (03PS1) 10Hashar: contint: restrict docker-pkg conf read perm [puppet] - 10https://gerrit.wikimedia.org/r/393798 [16:36:23] 10Operations, 10Services, 10Graphite, 10Wikimedia-Incident: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333#3793015 (10fgiunchedi) [16:36:25] 10Operations, 10Graphite, 10Patch-For-Review, 10Services (watching), 10Wikimedia-Incident: Alert on graphite UDP loss - https://phabricator.wikimedia.org/T181382#3793013 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi [16:36:27] 10Operations, 10ops-eqiad, 10Analytics-Kanban: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3792843 (10Cmjohnson) Yes, there is definitely a bad disk but the memory errors are probably a bad CPU or motherboad. It would extremely bad fortune for all 8 DIMM to fail at one time. But I di... [16:37:21] 10Operations, 10Services, 10Graphite, 10Wikimedia-Incident: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333#3786600 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Resolving, we haven't seen adverse effects from service-runner's latest version. [16:37:36] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: restrict docker-pkg conf read perm [puppet] - 10https://gerrit.wikimedia.org/r/393798 (owner: 10Hashar) [16:38:22] PROBLEM - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused [16:38:42] elukey: kafka1018 motherboard issues [16:39:12] :( [16:42:01] <_joe_> elukey: there you go, did you have plans for tonight? :P [16:42:45] _joe_: kafka can take it! It is distributed and resilient :P [16:42:52] famous last words [16:42:54] elukey: it's webscale! [16:43:08] * volans hides [16:45:22] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [16:50:00] madhuvishy: yt? [16:50:34] she's in a meeting [16:51:11] all right so we have a couple of options [16:52:18] 1) Chris does some magic and the hardware goes back to a normal state (but it seems unlikely from what Chris said in the task) [16:53:01] 2) we repurpose one of the notebook100[12] host to kafka1018 and decom the old host (they have the same hw specs). We currently don't know if this is possible, so we'll ask more info to Madhu [16:53:02] elukey: no magic to be had unless I swap out the motherboard....which I could do but probably not today [16:53:17] (03PS1) 10Giuseppe Lavagetto: site.pp: remove import of realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/393799 [16:53:44] 3) decom in kafka 1018, but it will cause topic partitions to move to other nodes and it wouldn't be ideal capacity wise [16:53:46] i would have to take from an old swift server but there are some differences [16:54:02] ah so a reimage would be enough? [16:54:22] not really urgent, we can wait a couple of days [16:54:36] in the meantime we prepare for 2) or 3) [16:55:42] 10Operations, 10Discovery, 10Traffic, 10Wikimedia-Apache-configuration, and 4 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3793143 (10Fjalapeno) [17:00:05] godog, moritzm, and _joe_: Dear deployers, time to do the Puppet SWAT(Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171128T1700). [17:00:05] dereckson: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:56] ottomata: hey! [17:01:09] dereckson: I'll merge your patch [17:01:27] (03PS4) 10Filippo Giunchedi: Clarify header documentation for Apache redirects [puppet] - 10https://gerrit.wikimedia.org/r/285973 (owner: 10Dereckson) [17:02:08] (03CR) 10Filippo Giunchedi: [C: 032] Clarify header documentation for Apache redirects [puppet] - 10https://gerrit.wikimedia.org/r/285973 (owner: 10Dereckson) [17:06:42] (03PS1) 10Awight: Remove beta cluster customizations for ORES [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393801 (https://phabricator.wikimedia.org/T181187) [17:08:48] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006 - https://phabricator.wikimedia.org/T181121#3793200 (10Cmjohnson) Both servers are in h/w test with the Dell diagnosti tool at this moment. The Dell racadm system log did not show any errors of any type. [17:08:51] madhuvishy: o/ [17:08:55] ^ :) [17:09:15] I'd need to bother you a second about notebooks :) [17:09:46] kafka1018 died and it is OOW, so we are exploring the possibility of repurposing one of the notebooks to a kafka role [17:10:08] (the notebooks be replaced soonish with new hw) [17:11:42] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4031 is CRITICAL: connect to address 10.128.0.131 and port 3128: Connection refused [17:11:58] ah, no one actively uses notebook1002 afaict, and since I don't actively work on it, I haven't split the usecases (mysql and mysql+hadoop) between the two servers. I mostly use notebook1002 as the canary to test package upgrades and stuff before applying to 1001 [17:12:13] elukey: when is the new hardware replacement stuff happening? [17:12:42] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4031 is OK: HTTP OK: HTTP/1.1 200 OK - 177 bytes in 0.157 second response time [17:12:46] (03PS4) 10Andrew Bogott: profile::puppetmaster::common: Always enable environments [puppet] - 10https://gerrit.wikimedia.org/r/393677 [17:14:21] (03CR) 10Andrew Bogott: [C: 032] profile::puppetmaster::common: Always enable environments [puppet] - 10https://gerrit.wikimedia.org/r/393677 (owner: 10Andrew Bogott) [17:14:59] (03PS4) 10Rush: apt: add class apt::dpkgconfold and include it from apt::unattendedupgrades [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) (owner: 10Arturo Borrero Gonzalez) [17:15:12] madhuvishy: we are in the process of getting specs and ordering, so we don't have a clear date yet.. but since we are a big ignorant about the internals of the notebooks we were wondering if one could be re-used or not [17:15:22] ah sorry reading the rest [17:15:38] (03PS1) 10Alexandros Kosiaris: Update to 1.7.11 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/393805 [17:15:50] akosiaris: \o/ [17:16:03] :-) [17:16:11] let's see how this compiles [17:16:21] RECOVERY - Host mw1276 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [17:16:35] madhuvishy: ah so theoretically notebook1002 could be temporarily used for kafka? Until the new hw gets in the dc [17:17:10] 10Operations, 10ops-eqiad: Lost network connectivity on mw1276 - https://phabricator.wikimedia.org/T181397#3793255 (10Cmjohnson) 05Open>03Resolved At first glance, the network led was dark. I rebooted the server, drained the flea power and the NIC came back to life. [17:17:43] elukey: yeah, go for it I think - are you planning to wipe it out and set up kafka, or set it up along side? [17:17:48] (03PS2) 10Alexandros Kosiaris: Update to 1.7.11 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/393805 [17:18:37] (03PS3) 10Alexandros Kosiaris: Update to 1.7.11 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/393805 (https://phabricator.wikimedia.org/T181489) [17:18:56] madhuvishy: complete reimage, but only if we can't find a used motherboard to swap on kafka1018 [17:19:20] elukey: hmmm okay [17:20:13] elukey: could you also send a note to analytics-l saying the backup notebook server is being reused for this when you do it? [17:22:48] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T181345#3793300 (10Cmjohnson) Replaced disk 3 and it's rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Rebuild Firmware state: Online, Spun... [17:23:39] madhuvishy: sure! I'll also ping you beforehand.. sorry for this trouble but we are not ready to switch clients to the new kafka jumbo cluster [17:23:54] 10Operations, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3793312 (10Cmjohnson) a:03RobH This did not get setup before shipping to Singapore...assigning to @RobH [17:23:58] I'm in a meeting, trying to catch up but... what kind of hw is kafka1018? [17:24:05] can't we just swap the disks with another system, like a spare? [17:24:07] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T181345#3793321 (10Marostegui) Thanks! [17:24:31] we may not be able to swap motherboards, but could be able to do it the other way around? [17:25:07] (swap the disks on an existing system) [17:25:08] 10Operations, 10ops-eqiad, 10Discovery, 10Maps, 10Maps-Sprint: Power supply issue on maps1002 - https://phabricator.wikimedia.org/T181477#3791578 (10Cmjohnson) 05Open>03Resolved Loose power cable. Fixed [17:25:33] paravoid: so notebook1002 has the same hw as kafka1018, this is why we were thinking about repurposing it [17:25:48] and Madhu confirmed that it is not heavily used so far [17:26:08] ok, I'm not objecting, just asking whether we can accomodate you in a way that makes this easier :) [17:26:11] (03PS3) 10Andrew Bogott: puppetmaster::standalone: include environment env [puppet] - 10https://gerrit.wikimedia.org/r/393678 [17:26:25] we can assign one of the on-site spares, temporarily or permanently [17:26:54] paravoid: sure! Didn't know about this possibility, that would be good too if the hw specs are not too distant [17:26:56] (03CR) 10Andrew Bogott: [C: 032] puppetmaster::standalone: include environment env [puppet] - 10https://gerrit.wikimedia.org/r/393678 (owner: 10Andrew Bogott) [17:27:21] robh: ^ :) [17:27:47] (apologies for the brevity, still in a meeting) [17:27:54] (03CR) 10Awight: [C: 032] Remove beta cluster customizations for ORES [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393801 (https://phabricator.wikimedia.org/T181187) (owner: 10Awight) [17:27:58] going to update the task with all the options [17:28:23] cool [17:28:34] <3 [17:31:42] 10Operations, 10ops-eqiad, 10Analytics-Kanban: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3793359 (10elukey) Summary of what has been discussed so far on IRC: - Chris will try to find a used motherboard in the DC and see if it can be swapped on kafka1018, so a simple reimage should... [17:31:46] there you go :) [17:31:53] also pinged rob in there [17:33:22] RECOVERY - IPMI Sensor Status on maps1002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [17:33:29] (03Merged) 10jenkins-bot: Remove beta cluster customizations for ORES [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393801 (https://phabricator.wikimedia.org/T181187) (owner: 10Awight) [17:33:43] (03CR) 10jenkins-bot: Remove beta cluster customizations for ORES [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393801 (https://phabricator.wikimedia.org/T181187) (owner: 10Awight) [17:35:26] (03CR) 10Deskana: [C: 04-1] "See my comment on Phabricator." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) (owner: 10TerraCodes) [17:35:31] (03PS1) 10Filippo Giunchedi: hieradata: disable restbase1007-c cassandra instance [puppet] - 10https://gerrit.wikimedia.org/r/393807 (https://phabricator.wikimedia.org/T179422) [17:35:49] urandom: ^ [17:36:26] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/393807 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [17:36:44] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: disable restbase1007-c cassandra instance [puppet] - 10https://gerrit.wikimedia.org/r/393807 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [17:36:51] (03PS2) 10Filippo Giunchedi: hieradata: disable restbase1007-c cassandra instance [puppet] - 10https://gerrit.wikimedia.org/r/393807 (https://phabricator.wikimedia.org/T179422) [17:36:56] (03PS1) 10Jcrespo: mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) [17:38:12] (03PS2) 10Jcrespo: mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) [17:38:16] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [17:39:21] RECOVERY - Disk space on restbase1007 is OK: DISK OK [17:39:31] RECOVERY - DPKG on restbase1007 is OK: All packages OK [17:39:32] RECOVERY - dhclient process on restbase1007 is OK: PROCS OK: 0 processes with command name dhclient [17:39:32] RECOVERY - Check size of conntrack table on restbase1007 is OK: OK: nf_conntrack is 1 % full [17:39:32] RECOVERY - MD RAID on restbase1007 is OK: OK: Active: 15, Working: 15, Failed: 0, Spare: 0 [17:39:32] RECOVERY - configured eth on restbase1007 is OK: OK - interfaces up [17:39:32] RECOVERY - cassandra-b service on restbase1007 is OK: OK - cassandra-b is active [17:39:32] RECOVERY - cassandra-a service on restbase1007 is OK: OK - cassandra-a is active [17:39:51] RECOVERY - Check whether ferm is active by checking the default input chain on restbase1007 is OK: OK ferm input default policy is set [17:39:51] RECOVERY - Check systemd state on restbase1007 is OK: OK - running: The system is fully operational [17:39:58] (03PS3) 10Jcrespo: mariadb: Setup s8 empty on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) [17:40:02] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:42:08] !log restart cassandra, restbase1007, to pickup logstash java deps - T179422 [17:42:08] (03CR) 10Jcrespo: "Needs thorough check, let's not rush the deployment, even if it is a noop." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393808 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [17:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:15] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [17:42:59] (03PS1) 10Filippo Giunchedi: Revert "hieradata: disable restbase1007-c cassandra instance" [puppet] - 10https://gerrit.wikimedia.org/r/393810 (https://phabricator.wikimedia.org/T179422) [17:43:47] (03CR) 10Filippo Giunchedi: [C: 04-1] "DNM yet" [puppet] - 10https://gerrit.wikimedia.org/r/393810 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [17:43:51] RECOVERY - Check the NTP synchronisation status of timesyncd on restbase1007 is OK: OK: synced at Tue 2017-11-28 17:43:43 UTC. [17:44:43] (03CR) 10Rush: "@andrew, this will catch and upgrade puppet clients. is that an issue?" [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) (owner: 10Arturo Borrero Gonzalez) [17:45:06] (03CR) 10Rush: "puppet-common upgrade is from 3.4.3-1ubuntu1.2 --to--> 3.8.5-2~bpo8trusty+2" [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) (owner: 10Arturo Borrero Gonzalez) [17:45:12] (03PS5) 10Rush: apt: add class apt::dpkgconfold and include it from apt::unattendedupgrades [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) (owner: 10Arturo Borrero Gonzalez) [17:46:16] !log decommissioning cassandra, restbase1007-b - T179422 [17:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:51] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received [17:49:31] PROBLEM - mediawiki-installation DSH group on mw1276 is CRITICAL: Host mw1276 is not in mediawiki-installation dsh group [17:49:51] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received [17:50:41] (03PS1) 10Muehlenhoff: Create prometheus user and switch systemd unit to it [debs/prometheus-openldap-exporter] - 10https://gerrit.wikimedia.org/r/393811 [17:52:05] We’re losing ORES servers, looks like we’ll have to do some manual restarting… [17:52:07] akosiaris: ^ [17:52:31] PROBLEM - ores on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:41] PROBLEM - SSH on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:41] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:42] PROBLEM - apertium apy on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:01] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [17:53:10] codfw is doing great. [17:53:14] We could fail over to it. [17:53:18] It has more capacity too [17:53:21] RECOVERY - ores on scb1001 is OK: HTTP OK: HTTP/1.0 200 OK - 3580 bytes in 0.026 second response time [17:53:31] RECOVERY - SSH on scb1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [17:53:32] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [17:53:32] RECOVERY - apertium apy on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5892 bytes in 0.003 second response time [17:53:51] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [17:54:03] akosiaris: ^ see halfak’s failover suggestion [17:54:15] (03CR) 10Smalyshev: Enable configuration for aliasing namespaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392554 (https://phabricator.wikimedia.org/T181016) (owner: 10Smalyshev) [17:54:35] it's only 1 box that is having problems though [17:54:47] two [17:54:50] scb1001 and scb1002 [17:55:42] https://grafana.wikimedia.org/dashboard/db/ores?panelId=3&fullscreen&orgId=1&from=now-3h&to=now-1m [17:55:46] akosiaris, ^ [17:55:59] You can scb1001 and 1002 periodically not serving requests [17:56:09] But 1003/1004 are fine [17:56:24] I 'll try and lower the weight first [17:56:35] send more traffic to scb1003, scb1004 [17:57:41] they are already handling 50% more traffic just fine [17:58:22] akosiaris, OK fair point. [17:58:36] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: scb1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [17:58:40] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: scb1002.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [17:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:51] ok, that part done [17:59:04] (03PS9) 10Smalyshev: Enable configuration for aliasing namespaces [puppet] - 10https://gerrit.wikimedia.org/r/392554 (https://phabricator.wikimedia.org/T181016) [17:59:11] now we got the issue I can't do the same for the workers [17:59:27] but I can already see load dropping [17:59:44] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=scb&var-instance=All [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171128T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:09] !log force stop celery-ores-worker on scb1001 [18:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:16] (03PS5) 10Elukey: Rename ::profile::hadoop::client to commons and move some features out [puppet] - 10https://gerrit.wikimedia.org/r/393738 (https://phabricator.wikimedia.org/T167790) [18:00:17] let's see if the others pick up the load [18:00:18] (03PS6) 10Elukey: profile::hadoop::common,profile::hive::client: move hiera config in one place [puppet] - 10https://gerrit.wikimedia.org/r/393741 (https://phabricator.wikimedia.org/T167790) [18:00:20] (03PS7) 10Elukey: profile::hadoop::common: import hiera config from cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/393756 (https://phabricator.wikimedia.org/T167790) [18:00:38] (03CR) 10Smalyshev: Enable configuration for aliasing namespaces (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392554 (https://phabricator.wikimedia.org/T181016) (owner: 10Smalyshev) [18:01:30] 10Operations, 10Scoring-platform-team: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793518 (10awight) [18:02:52] (03PS1) 10Smalyshev: Enable wdqs-admins to restart nginx [puppet] - 10https://gerrit.wikimedia.org/r/393814 [18:02:54] 10Operations, 10Ops-Access-Requests: Requesting access to terbium/wasat for Trey Jones - https://phabricator.wikimedia.org/T181479#3793533 (10EBernhardson) [18:03:03] memory is really low on scb1002 [18:03:07] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review, 10Services (watching): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051#3793534 (10debt) p:05Triage>03High [18:03:12] 10Operations, 10Ops-Access-Requests, 10Gerrit: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3793536 (10demon) >>! In T181446#3791417, @Legoktm wrote: > Do we have alerts for Gerrit exceptions? If we could set up IRC alerts (or something else), paladox could fo... [18:03:18] 10Operations, 10Scoring-platform-team: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793540 (10awight) [18:03:21] Doesn't look like out memory usage changed :/ [18:03:22] 10Operations, 10Ops-Access-Requests: Requesting access to terbium/wasat for Trey Jones - https://phabricator.wikimedia.org/T181479#3791630 (10EBernhardson) @EBjune this will require your approval [18:03:27] *our --> ORES [18:03:45] so, the other hosts are picking up the load [18:03:54] rather okish I 'd saty [18:03:56] say* [18:04:04] now on scb1001 OOM showed up [18:04:11] halfak: Where are you seeing OOM? https://grafana.wikimedia.org/dashboard/db/ores?panelId=6&fullscreen&orgId=1&from=1511881438154&to=1511892178154 [18:04:15] killing multiple things, including nodejs services [18:04:24] Not seeing OOM just checking available memory [18:04:30] with top :) [18:04:40] that’s a serious problem if our monitoring doesn’t see that! [18:04:54] 10Operations, 10Discovery, 10Traffic, 10Wikimedia-Apache-configuration, and 4 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3793547 (10JoeWalsh) Confirmed that we can remove the `*.m.wikipedia.org` from the config and the OS will stop re... [18:04:59] halfak: KiB Mem: 32868828 total, 18434972 used, 14433856 free, 20892 buffers [18:05:02] 'sup [18:05:03] akosiaris, confirmed that overload errors are dropping [18:05:21] awight, what machine? [18:05:26] scb1001 [18:05:29] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:05:31] I'm on scb1002 [18:05:33] and scb1002 [18:05:41] both had OOM show up [18:05:48] KiB Mem: 32868828 total, 32348372 used, 520456 free, 20912 buffers [18:06:02] I see a vastly different # of free KiB than you [18:06:14] halfak: `top` ?? [18:06:15] KiB Mem: 32868828 total, 18546496 used, 14322332 free, 10564 buffers [18:06:18] on scb1002 [18:06:27] Ahh yeah. Just changed [18:06:34] also scb1003 and scb1004 had OOM show up [18:06:50] but way less, they are handling it better [18:06:58] no parsoid deploy today [18:07:09] makes sense, they are better boxes [18:07:59] actually this has been happening since yesterday [18:08:08] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=scb&var-instance=All&from=now-12h&to=now [18:08:11] Someone kill all the celery on scb1002? [18:08:24] look at how scb1001, scb1002 have weird graphs on memory [18:08:37] halfak: I did. I stopped it on purpose [18:08:44] logged it at SAL as well [18:08:49] kk [18:08:52] RECOVERY - IPMI Sensor Status on restbase1007 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [18:08:56] (03CR) 10Andrew Bogott: [C: 031] apt: add class apt::dpkgconfold and include it from apt::unattendedupgrades [puppet] - 10https://gerrit.wikimedia.org/r/392421 (https://phabricator.wikimedia.org/T180811) (owner: 10Arturo Borrero Gonzalez) [18:09:48] Doesn't look like it had to do with ORES deployment [18:10:08] there is also https://phabricator.wikimedia.org/T181346 that might had some effect [18:10:10] Pattern seems to start 11/23 [18:11:00] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Enable wdqs-admin's to control nginx - https://phabricator.wikimedia.org/T181540#3793565 (10Smalyshev) [18:11:05] scratch the scb1003, scb1004 OOM comment [18:11:15] it was an expected cgroup oom [18:11:16] no OOM there? [18:11:17] trending edits not getting its last committed offset and then pulling everything from kafka (from the beginning of the topic) [18:11:25] (03PS2) 10Smalyshev: Enable wdqs-admins to restart nginx [puppet] - 10https://gerrit.wikimedia.org/r/393814 (https://phabricator.wikimedia.org/T181540) [18:11:32] halfak: there is but it's unrelated and it's actually a safeguard [18:11:38] kk [18:11:47] limiting the memory of electron pdf render so it doesn't cause issues [18:11:56] so the other 2 boxes are handling everything fine yet [18:12:21] elukey: Interesting guess—I was noticing that the “cached” memory usage is spiking along with total usage, that might be consistent with the Kafka glitch theory? [18:12:59] halfak: it seems like we are holding as we are currently [18:12:59] We are getting a *terrible* cache hit rate recently. [18:13:01] * halfak cecks that [18:13:11] we are ? [18:13:31] awight: just mentioned the task since it is a unsolved (afaik) issue on scb100[12] nodes [18:13:47] but from the logs I didn't notice any correlation (trending edits) [18:13:54] looks unrelated. [18:14:00] ah yes, since 15:43 UTC today [18:14:06] right. [18:14:11] halfak: just mentioned, that's it :) [18:14:14] We got a big boost in traffic. [18:14:17] thanks elenah [18:14:20] *eluke [18:14:20] * elukey afk [18:14:23] arg [18:14:26] :D [18:15:01] halfak: and an increase in external scores as well ? [18:15:12] judging from https://grafana.wikimedia.org/dashboard/db/ores?panelId=1&fullscreen&orgId=1&from=now-3h&to=now-1m [18:15:29] so... extra req/s at the service ? [18:15:51] akosiaris, right. Outside of precaching/kafka/changeprop. [18:16:12] Usually we don't see a spike last this long. [18:16:19] well https://grafana.wikimedia.org/dashboard/db/ores?panelId=10&fullscreen&orgId=1&from=now-3h&to=now-1m doesn't show a spike [18:16:45] akosiaris, right. Sort of a drop in the bucket. [18:16:48] But correlated. [18:17:08] !log awight@tin Started deploy [ores/deploy@e58bfbf]: (non-production) Update ORES on new cluster [18:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:29] awight: please tell me you are not updating ORES on scb boxes [18:17:43] halfak: I am not seeing the correlation yet on those graphs [18:17:51] akosiaris: :D no, only non-production ores* boxes [18:18:03] https://grafana.wikimedia.org/dashboard/db/ores?panelId=1&fullscreen&orgId=1&from=now-3h&to=now-1m [18:18:05] akosiaris, ^ [18:18:33] FWIW, we can usually handle around 3 times this request rate without issues. [18:18:34] so yeah I see that. Starting on 16:00 UTC todaty [18:18:35] akosiaris: Looks like that’s not going to work though, maybe we can diagnose after the current situation is resolved. > Timeout, server ores1003.eqiad.wmnet not responding. etc. [18:18:49] !log awight@tin Finished deploy [ores/deploy@e58bfbf]: (non-production) Update ORES on new cluster (duration: 01m 41s) [18:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:07] halfak: scratch that, I see it [18:19:14] https://grafana.wikimedia.org/dashboard/db/ores?panelId=11&fullscreen&orgId=1&from=now-3h&to=now-1m [18:19:20] my bad eyes, sorry I should have looked at the scale more [18:19:33] ok we have more external users [18:19:39] (03PS1) 10Chad: scap prep: Don't copy the patch dir if it already exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393815 [18:19:43] Right and soon after the timeouts started. [18:19:53] But only on scb1001 and scb1002 [18:19:56] yeah so someone is hitting us real hard [18:20:11] RECOVERY - Long running screen/tmux on restbase1007 is OK: OK: SCREEN detected but not long running. [18:20:18] and managed to trip the 2 lower power boxes [18:20:26] the other 2 are just fine [18:20:42] I am tempted to say we handle the load fine with those 2 currently [18:21:03] but we can of course fail over to codfw if we are afraid this is gonna get worse [18:21:23] is my assumption correct ? I don't see any issue on scb1003, scb1004 currently [18:21:31] (03PS1) 10Chad: group0 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393817 [18:21:36] are we managing fewer scores than we should ? [18:21:44] which I guess warrants a failover to codfw ? [18:21:47] halfak: ^ [18:22:15] akosiaris, I'd like to fail over to codfw. What are the drawbacks? [18:22:18] !log demon@tin Started scap: bootstrap wmf.10 [18:22:20] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3793642 (10Cmjohnson) p:05High>03Low [18:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:25] codfw should be able to handle *way* more capacity. [18:22:56] well we do have those bottlenecks with the number of concurrent ORES workers across the entire fleet [18:23:08] it was a variable right ? [18:23:21] I remember setting it to a different value with awight for the new cluster [18:23:26] akosiaris: Yes it’s tunable by config—but it’s the total number per machine, not across the entire cluster. [18:23:37] aah ok I forgot about that [18:24:00] so, do you feel we have problems right now we the current capacity ? [18:24:19] on my part is looks like maybe we could [18:24:41] I do see some overload errors but it's mostly scb1001, scb1002 [18:24:49] akosiaris: ores::web::celery_workers [18:24:53] and given the OOM killer on those I guess that makes sense [18:24:57] (03PS2) 10Chad: scap prep: Don't copy the patch dir if it already exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393815 [18:25:18] but currently scb1003, scb1004 are at 0 [18:25:36] (03PS1) 10Madhuvishy: Revert "public_dumps: Add puppet class to set up NFS for dumps servers" [puppet] - 10https://gerrit.wikimedia.org/r/393818 [18:25:44] I am worried about scb1003/1004 but it seems like we are handling a relatively high load with just scb1003/4. [18:25:50] (03PS2) 10Madhuvishy: Revert "public_dumps: Add puppet class to set up NFS for dumps servers" [puppet] - 10https://gerrit.wikimedia.org/r/393818 [18:25:55] We'd have a much higher ceiling with codfw. [18:26:08] It seems that codfw is intended for this type of situation [18:26:23] 10Operations, 10Scoring-platform-team: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793657 (10awight) We saw very different available memory levels using `top` directly on scb100[1-2], vs. the ORES Grafana dashboard which never showed a dip below c. 20GB. This needs to be fi... [18:26:41] halfak: that's actually not something we have defined very clearly [18:26:46] (03CR) 10Madhuvishy: [C: 032] Revert "public_dumps: Add puppet class to set up NFS for dumps servers" [puppet] - 10https://gerrit.wikimedia.org/r/393818 (owner: 10Madhuvishy) [18:26:51] but yeah I can see your point [18:26:58] ok let's be prudent and failover [18:27:11] we might have way more incoming reqs [18:27:25] Right. Thanks. [18:28:06] Just a note of caution—Fundraising started their biggest campaign today, so if there’s any hint of instability on the wikis themselves, let’s turn ORES off for enwiki. [18:28:37] * akosiaris preparing patch [18:28:39] awight, that would cause massive instability for enwiki in the form of vandalism [18:28:41] :/ [18:28:46] !log (re)bootstrapping cassandra, restbase1007-b - T179422 [18:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:54] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [18:29:05] halfak: True. Hopefully we don’t have to do that triage. [18:29:15] I don't expect we' [18:29:21] 'll be taking down S:RC again [18:29:34] * awight maintains the thousand-yard stare [18:29:41] (03PS1) 10Andrew Bogott: Horizon puppettab: read puppet class info from the 'future' environment [puppet] - 10https://gerrit.wikimedia.org/r/393819 [18:30:33] !log awight@tin Started deploy [ores/deploy@e58bfbf]: (non-production) Update ORES on new cluster (take 2) [18:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:01] (03Abandoned) 10Dzahn: grafana: add dashboards for ganeti [puppet] - 10https://gerrit.wikimedia.org/r/393696 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:31:08] (03PS1) 10Alexandros Kosiaris: Failover ORES to codfw [puppet] - 10https://gerrit.wikimedia.org/r/393820 (https://phabricator.wikimedia.org/T181538) [18:31:26] (03PS1) 10Madhuvishy: Add ldap client nsswitch config file to labstore1006|7 hiera [puppet] - 10https://gerrit.wikimedia.org/r/393821 (https://phabricator.wikimedia.org/T181431) [18:31:40] (03CR) 10Alexandros Kosiaris: [C: 032] Failover ORES to codfw [puppet] - 10https://gerrit.wikimedia.org/r/393820 (https://phabricator.wikimedia.org/T181538) (owner: 10Alexandros Kosiaris) [18:31:57] (03PS2) 10Madhuvishy: Add ldap client nsswitch config file to labstore1006|7 hiera [puppet] - 10https://gerrit.wikimedia.org/r/393821 (https://phabricator.wikimedia.org/T181431) [18:32:41] (03CR) 10Madhuvishy: [C: 032] Add ldap client nsswitch config file to labstore1006|7 hiera [puppet] - 10https://gerrit.wikimedia.org/r/393821 (https://phabricator.wikimedia.org/T181431) (owner: 10Madhuvishy) [18:32:50] !log force puppet run on cache::misc boxes T181538 [18:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:00] T181538: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538 [18:33:53] ok traffic from all varnishes should start flowing to codfw [18:34:14] That was an awesome one-character change, btw. [18:34:25] awight: thank bblack :-) [18:34:28] Hey, varnish: >_> [18:34:56] 10Operations, 10Scoring-platform-team, 10Patch-For-Review: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793518 (10Halfak) Strange memory behavior on scb1001 and scb1002 for the last week: https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-da... [18:35:03] !log awight@tin Finished deploy [ores/deploy@e58bfbf]: (non-production) Update ORES on new cluster (take 2) (duration: 04m 30s) [18:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:29] 10Operations, 10Scoring-platform-team, 10Patch-For-Review: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793712 (10Halfak) scb1001 and scb1002 had OOMs show up. ``` [12:04:04] now on scb1001 OOM showed up ... [12:05:32] and scb1002 [12:05:41] (03PS2) 10Andrew Bogott: Horizon puppettab: read puppet class info from the 'future' environment [puppet] - 10https://gerrit.wikimedia.org/r/393819 [18:35:54] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/9030/" [puppet] - 10https://gerrit.wikimedia.org/r/392997 (owner: 10Dzahn) [18:36:01] (03PS2) 10Dzahn: url_downloader: move firewall to role, use profile [puppet] - 10https://gerrit.wikimedia.org/r/392997 [18:37:14] I can see cpu usage increasing on codfw [18:37:17] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/392554 (https://phabricator.wikimedia.org/T181016) (owner: 10Smalyshev) [18:37:55] I am guessing it will take a while for ORES cache to be fully warmed up [18:38:17] 10Operations, 10Scoring-platform-team, 10Patch-For-Review: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793728 (10Halfak) We had a sudden increase in requests/min for ORES around 1600 UTC. But we've seen bigger spikes that did now cause timeouts or memory issues around 130... [18:38:19] akosiaris, shouldn't take too long. [18:38:21] it's https://grafana.wikimedia.org/dashboard/db/ores?panelId=12&fullscreen&orgId=1&from=now-5m&to=now-1m [18:38:26] (03CR) 10Gehel: [C: 04-1] "LGTM, but I think this needs to be validated in our Ops meeting. I'll make sure to bring that up." [puppet] - 10https://gerrit.wikimedia.org/r/393814 (https://phabricator.wikimedia.org/T181540) (owner: 10Smalyshev) [18:38:37] halfak: yup it shouldn't indeed [18:38:40] ccccccevibtjurbdufihuenhgftlbgcvvkickbhjfkhf [18:38:46] lol [18:38:47] no worries, unused yubikey [18:38:58] (03CR) 10Andrew Bogott: [C: 032] Horizon puppettab: read puppet class info from the 'future' environment [puppet] - 10https://gerrit.wikimedia.org/r/393819 (owner: 10Andrew Bogott) [18:39:00] I should pull it out of my laptop [18:39:02] (03PS3) 10Andrew Bogott: Horizon puppettab: read puppet class info from the 'future' environment [puppet] - 10https://gerrit.wikimedia.org/r/393819 [18:39:52] !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: scb1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [18:39:54] 10Operations, 10Scoring-platform-team, 10Patch-For-Review: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793732 (10Halfak) We've failed over to CODFW. [18:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:57] !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: scb1002.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [18:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:14] !log revert weight changes for scb1001, scb1002 T181835 [18:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:14] akosiaris: just to make sure (I have ignored scrollback, was in a meeting), this isn't something to block the train, right? [18:41:32] greg-g: no it doesn't have to, not anymore [18:41:35] I mean, in a bit, chad's still prepping it [18:42:06] akosiaris: kk, thank you. is it an incident report worthy thing? :) [18:42:09] * no_justification really really wants a normal train week [18:42:14] greg-g: yes it is [18:42:18] * greg-g nods [18:42:25] halfak is already collecting notes in T181835 [18:42:29] thanks! [18:42:32] 10Operations, 10Scoring-platform-team, 10monitoring: Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3793736 (10awight) [18:42:32] I 'll add mine later on [18:43:42] RECOVERY - puppet last run on labstore1006 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:44:26] (03CR) 10Dzahn: "no-op on aluminium, actinium" [puppet] - 10https://gerrit.wikimedia.org/r/392997 (owner: 10Dzahn) [18:44:31] T181835 is a 404 [18:45:01] T181538 [18:45:03] T181538: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538 [18:45:06] dyslexia is hard [18:45:21] lol ty for digging that up [18:45:32] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793766 (10greg) [18:46:04] :) [18:46:39] 10Operations, 10Discovery, 10Traffic, 10Wikimedia-Apache-configuration, and 4 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3793768 (10Fjalapeno) @Mholloway spoke with @JoeWalsh about the timing and targeting first week of January seems... [18:49:41] 10Operations, 10Scoring-platform-team: Let the ORES application set log severity, not uWSGI - https://phabricator.wikimedia.org/T181546#3793772 (10awight) [18:50:10] (03PS1) 10Legoktm: Set $wgRestrictionMethod = 'firejail'; everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393825 (https://phabricator.wikimedia.org/T173370) [18:50:11] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793785 (10Halfak) From https://grafana.wikimedia.org/dashboard/db/ores?panelId=14&fullscreen&orgId=1&from=1511872559429&to=1511894099429, we can s... [18:50:12] sorry about that :-( [18:50:15] (03CR) 10Chad: [V: 032 C: 032] Gerrit: Add wmf branding to PolyGerrit [software/gerrit] - 10https://gerrit.wikimedia.org/r/393753 (owner: 10Paladox) [18:51:01] !log demon@tin Started deploy [gerrit/gerrit@571cf4c]: deploying 2.15+ polygerrit style changes [18:51:06] (03PS3) 10Dzahn: ganeti: create profiles, split monitoring/firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/392564 [18:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:10] !log demon@tin Finished deploy [gerrit/gerrit@571cf4c]: deploying 2.15+ polygerrit style changes (duration: 00m 09s) [18:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:20] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793822 (10awight) Looks like we've been hitting memory limits for quite a while, at least since Oct 26th: https://logstash.wikimedia.org/goto/4e64... [18:54:32] RECOVERY - puppet last run on labstore1007 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:57:39] !log demon@tin Finished scap: bootstrap wmf.10 (duration: 35m 20s) [18:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:30] (03PS1) 10Chad: Add closed wikis to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393827 [19:04:39] (03PS2) 10Chad: Add closed wikis to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393827 [19:04:41] RECOVERY - MegaRAID on db1051 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [19:05:04] 10Operations, 10ORES, 10Scoring-platform-team (Current), 10User-Joe, 10User-Ladsgroup: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3793896 (10awight) [19:05:21] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3793896 (10awight) [19:06:59] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3793896 (10awight) [19:09:51] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3793924 (10awight) This could be another submodule rewriting problem. I see that .gitmodules has been modified on the ta... [19:12:27] 10Operations, 10ops-eqsin: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3793949 (10RobH) [19:12:37] 10Operations, 10ops-eqsin: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3793964 (10RobH) [19:14:33] 10Operations, 10ops-eqsin: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3793949 (10RobH) [19:15:20] 10Operations, 10ops-eqsin: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#3793986 (10RobH) [19:15:44] 10Operations, 10ops-eqsin: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#3794006 (10RobH) [19:16:52] PROBLEM - trendingedits endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:17:01] PROBLEM - eventstreams on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 8092: Connection refused [19:17:21] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200) [19:17:52] RECOVERY - trendingedits endpoints health on scb2001 is OK: All endpoints are healthy [19:18:01] RECOVERY - eventstreams on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.169 second response time [19:18:17] 10Operations, 10ops-eqsin: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#3794022 (10RobH) [19:18:21] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [19:18:22] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [19:18:31] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve the selected anniversaries for January 15 [19:18:31] a response was received: /{domain}/v1/page/featured/{yyyy}/{mm}/{dd} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [19:19:21] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [19:19:22] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [19:19:31] mutante: Hey, want to help us freak out about an ORES service meltdown? [19:19:50] we have unstaged changed in tin, [19:19:57] I want to rebase but it doesn't let me [19:20:00] FYI [19:20:11] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-se [19:20:11] out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page [19:20:11] mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles [19:20:21] akosiaris: mutante: We were dealing with T181538 and failed over to codfw. Now codfw is gone. [19:20:23] T181538: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538 [19:20:31] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [19:20:50] (03CR) 10Greg Grossmeier: [C: 031] "+1 to the goal." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393827 (owner: 10Chad) [19:21:02] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [19:21:31] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [19:21:43] Can't even ssh to scb2001 [19:22:02] PROBLEM - ores uWSGI web app on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:22:09] It's end of the times [19:22:23] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [19:22:48] What the hell? [19:22:52] Well. I hope we don’t melt down the servers during the Fundraiser. [19:22:53] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T181345#3794049 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good - thanks Chris! ``` root@db1051:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name... [19:22:59] eh, hi, whats' up [19:23:19] do you need to switch ores back from codfw? ehmm.. [19:23:28] mutante: Hi—sorry to throw you into this, but we had to fail over ORES a few minutes ago, and now we’re looking at further unknown badness. [19:23:42] mutante: I don’t know if that’s the fix. Feel free to do it if you think so. [19:23:51] awight: that's some sort of not bad, "Look, we are so under-resourced that we are down all the time, please help" [19:24:04] *ahem* [19:24:14] Amir1 is correct :D [19:25:00] awight: well.. all i know is what it says on that change the reason was [19:25:18] mutante: Do you see anything systemic wrong with CODFW? [19:25:30] ssh scb2001.eqiad.wmnet fails entirely [19:25:45] woops [19:25:47] I'm dumb [19:25:48] codfw [19:26:11] so, per "scb1001, scb1002 are having memory issues and we are afraid scb1003, scb1004 won't survive the new amount of req/s we started receiving today" [19:26:30] we could only change back to that but ^ [19:28:12] Looks like all of codfw went down at the same time. [19:28:17] I suspect redis node. [19:28:27] so in codfw these are on scb ? [19:28:32] but in eqiad they are dedicated? [19:28:41] the redis node? [19:28:54] are we talking about the oresrdb2* then?? [19:29:14] 10Operations, 10ops-codfw, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): HP RAID Battery issue on elastic2004 - https://phabricator.wikimedia.org/T181412#3794065 (10Papaul) @Gehel Tracking information shows 10:30am CT as delivering time and it is almost 2pm. I contact UPS they let me k... [19:29:26] what does "unknown badness" currently mean ? [19:29:46] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3794066 (10Papaul) @Marostegui Tracking information shows 10:30am CT as delivering time and it is almost 2pm. I contact UPS they let me know that due to the pass holidays the package will not be delivered un... [19:29:52] fwiw, i can ssh to oresrdb2001 fine and it's not busy at all [19:30:04] mutante, ORES is currently not serving any scoring requests. The machines themselves look OK [19:30:13] 10Operations, 10ops-eqsin, 10netops: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3794067 (10RobH) [19:31:00] 10Operations, 10ops-eqsin: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3794082 (10RobH) [19:31:15] Amir1: Tossed my change. FYI in the future, best course of action is to `git stash` it, merge your thing, then re-apply it from `git stash` [19:31:21] 10Operations, 10ops-eqsin, 10netops: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3794067 (10RobH) [19:31:23] 10Operations, 10ops-eqsin: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3793949 (10RobH) [19:31:46] no_justification: just wanted to be sure nothing is wrong in prod [19:31:49] 10Operations, 10ops-codfw, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): HP RAID Battery issue on elastic2004 - https://phabricator.wikimedia.org/T181412#3794090 (10Gehel) @Papaul you don't necessarily need me around for that. The only things I would do: * depool server * schedule dow... [19:31:56] thanks, I do that the next time [19:32:01] Unless someone came behind me and broke stuff :p [19:32:02] halfak: is that "outage"-level-bad? it seems going back involves the risk that Alex put on the message. should i escalate it to more people? [19:32:09] scb2001 responds to scoring requests from localhost [19:32:43] So does scb2002 [19:32:58] and scb2003. [19:33:03] 10Operations, 10ops-eqsin, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3794094 (10RobH) [19:33:10] sorry mutante. I'm not sure. [19:33:14] I can describe the situation. [19:33:25] I'm not sure I'm familiar with the norms of escalation in this case. [19:33:56] akosiaris: should we revert the switch of ORES to codfw? [19:34:30] halfak: The overload event seems to have ended [19:34:45] it did [19:34:50] halfak: did it ever work from codfw and then stop or did it just never work? you say it works from localhost but not external? [19:34:52] I'm noticing that too [19:35:04] mutante, worked for a couple hours. [19:35:08] (03CR) 10Thcipriani: [C: 031] Add closed wikis to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393827 (owner: 10Chad) [19:35:21] Had 7 minutes of downtime. [19:35:27] Now back at 100% capacity. [19:35:34] ah [19:36:25] What happened? [19:36:48] (03PS3) 10Chad: Add closed wikis to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393827 [19:36:53] 10Operations, 10ops-codfw, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): HP RAID Battery issue on elastic2004 - https://phabricator.wikimedia.org/T181412#3794113 (10Papaul) @Gehel UPS just give me wrong timing> I just got email confirmation that i got the package on site. can you go... [19:36:54] And why did it happen to 6 machines all at once? [19:37:25] (03PS4) 10Chad: Add closed wikis to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393827 [19:38:30] nostalgiawiki will soon be on the bleeding edge again [19:38:45] (03CR) 10Thcipriani: [C: 031] scap prep: Don't copy the patch dir if it already exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393815 (owner: 10Chad) [19:39:57] which hostnames exactly are the 6, ores201-2006 ? [19:40:08] because there are some more, but not installed or so [19:40:47] mutante: No, these are scb200* [19:40:49] the memory/CPI all looks really low [19:41:01] ores* is a non-production cluster that we haven’t pooled yet. [19:41:15] (s/a cluster/two clusters/) [19:41:31] ok, so there are only 2 of them in codfw then [19:41:43] and 2 in eqiad [19:42:38] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=oresrdb2001.codfw.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+codfw [19:42:49] there you can see it, network just stops [19:44:01] and on the other server https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=oresrdb2002.codfw.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+codfw [19:46:43] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3794136 (10Marostegui) No problem at all! Thanks for the heads up! [19:48:47] mutante, suggests that oresrdb2* lost network? [19:49:58] halfak: maybe, but i cant see anything obvious on oresrdb2001 syslog at that time [19:50:29] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3794148 (10Papaul) a:05Papaul>03Marostegui @Marostegui it looks like UPS give me wrong timing i got the park. Disk replacement complete [19:51:26] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794152 (10Halfak) All nodes in CODFW just went down at the same time. For a short period. See * [Overload errors](https://grafana.wikimedia.org... [19:51:50] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794155 (10Halfak) My hypothesis is that our redis nodes had a network blip and that caused all scoring requests to back up for a period. [19:52:17] mutante, ganglia login is wikitech credentials? [19:52:20] hmm, oresrdb2001 is missing in racktables, wanted to check if same rack [19:52:49] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Investigate redis-cluster or other techniques for making Redis not a single point of failure. - https://phabricator.wikimedia.org/T181559#3794157 (10awight) [19:52:50] halfak: yea, i am in the middle of deleting it but it still works for this [19:53:08] kk thanks. Got it. fat fingered the password [19:53:34] Right... yeah, I see the network now. [19:54:02] * halfak sets alter status to orange [19:54:19] mutante, I'm downgrading my concerns -- hoping this was just a network blip. [19:54:27] Will keep monitoring. Thanks for your help. [19:54:55] halfak: i also opted for not touching the change in this case. and i want to figure out if both servers are in the same physical rack / switch [19:55:51] would love to know that too. My stomach is going to be tight until I know what the heck that was. :| [19:56:12] * halfak imagines how many sleepless nights were lost over someone tripping on a cable. [19:57:32] in eqiad they are separate racks says racktables [19:57:43] What about codfw [19:58:08] mutante: Should we make a task to request they live together? [19:59:41] we'd want them to not live together so that it's less likely both have an issue at once, afaict [19:59:59] halfak: turns out the reason 2002 isnt in racktables.. is that it's virtual. VM :p [20:00:04] no_justification: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171128T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:23] so the difference is oresrdb2002 is on ganeti [20:00:52] mutante, lolwat [20:01:32] (03PS1) 10Urbanecm: Add NS aliases for zh_wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393835 (https://phabricator.wikimedia.org/T181374) [20:03:10] halfak: both oresrdb in codfw are virtual machines. i got confused because there must have been an physical "oresrdb2001" as well in the past. racktables is just about physical machines [20:03:45] Oh gotcha. Could it be that the machine they are hosted on had an issue? Are there other vms we could check in ganglia? [20:04:20] Is that… safe? [20:07:30] i checked 2 other VMs on the same ganeti cluster but dont see the gap like that in networking graph [20:07:52] among them for example mx2001 [20:08:59] 10Operations, 10Discovery, 10Traffic, 10Wikimedia-Apache-configuration, and 4 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3794209 (10Mholloway) Will do. Thanks, @JoeWalsh & @Fjalapeno. [20:09:52] halfak: correction.. they ARE actually mixed, one VM and one physical, as Robh correctly points out [20:10:38] but this makes networking issue less likely since they both showed it [20:11:52] (03CR) 10Chad: [C: 032] Add closed wikis to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393827 (owner: 10Chad) [20:12:00] (03PS3) 10Chad: scap prep: Don't copy the patch dir if it already exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393815 [20:13:32] "1514 [510] 28 Nov 19:17:31.012 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis. [20:13:41] ^ this happened shortly before the outage [20:14:36] (03Merged) 10jenkins-bot: Add closed wikis to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393827 (owner: 10Chad) [20:14:52] only shows up on 2001, not on 2002 [20:16:11] !log demon@tin Synchronized dblists/group0.dblist: adding some new wikis (duration: 00m 48s) [20:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:27] (03CR) 10Chad: [C: 032] scap prep: Don't copy the patch dir if it already exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393815 (owner: 10Chad) [20:17:14] (03CR) 10jenkins-bot: Add closed wikis to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393827 (owner: 10Chad) [20:17:54] Looks like we're having an event again. We see a cascade of requests for test_stats from MW and then our web workers start to lock up (maybe) [20:18:37] (03Merged) 10jenkins-bot: scap prep: Don't copy the patch dir if it already exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393815 (owner: 10Chad) [20:19:44] !log demon@tin Synchronized scap/plugins/prep.py: no-op (duration: 00m 48s) [20:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:35] (03CR) 10jenkins-bot: scap prep: Don't copy the patch dir if it already exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393815 (owner: 10Chad) [20:21:13] And here we go again. [20:21:22] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794219 (10awight) @Dzahn found a clue to the latest *codfw* outage, in which oresrdb Redis network traffic spikes and then crashes to zero: ``` [3... [20:21:43] ORES is down. Looks like we're on an hourly period [20:21:58] At 19:13, overload errors start [20:22:22] Everything calms down by 19:27 [20:22:41] Then at 20:16 it starts up again [20:23:53] (03PS1) 10Dzahn: Revert "Failover ORES to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/393839 [20:24:03] "Asynchronous AOF fsync is taking too long (disk is busy?). " [20:24:04] ^ mutante yes, that. Thank you. [20:24:54] mutante: Please also check that the LB weights favor scb100[3-4] over 100[1-2]. [20:25:10] akosiaris set that a few hours ago, just want to confirm it’s still configured that way. [20:25:28] (03CR) 10Dzahn: [C: 032] "reverting this because we kept having outages on codfw and on oresrdb2001 (the virtual machine) we had "Asynchronous AOF fsync is taking t" [puppet] - 10https://gerrit.wikimedia.org/r/393839 (owner: 10Dzahn) [20:25:58] (03PS2) 10Dzahn: Revert "Failover ORES to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/393839 [20:28:57] !log forcing puppet run on cache misc to revert "failover ORES to codfw" [20:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:12] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3794273 (10Halfak) [20:30:13] awight: checked. it's here https://config-master.wikimedia.org/pybal/eqiad/ores [20:30:21] ty! [20:31:31] it ran puppet on all cache servers, it should be switched [20:32:08] that's eqiad and codfw as before, not "eqiad-only" [20:35:07] mutante: BTW we noticed that when scb1001-2 went down, we didn’t get codfw picking up any of the traffic. Is there anything special we’d have to do to let the LB know about machine failures? [20:35:51] Traffic is back on eqiad. [20:35:57] Getting a lot of requests to scb1001 [20:36:25] (03PS2) 10Chad: group0 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393817 [20:36:50] awight: not that i'm aware of. it _should_ use both if both are active like that but best to check that with Traffic team [20:36:53] https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=now-2h&to=now-1m&panelId=10&fullscreen [20:37:53] well, let's hope that it lasts (re: "memory issues" that have been mentioned).. knocks on wood [20:38:10] mutante: Thanks for all the help! [20:38:13] 10Operations, 10ops-codfw, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): HP RAID Battery issue on elastic2004 - https://phabricator.wikimedia.org/T181412#3794304 (10Papaul) a:05Papaul>03Gehel Battery replacement complete Firmware update complete [20:38:25] yes. Thank you mutante :) [20:39:04] you're welcome [20:40:06] (03CR) 10Chad: [C: 032] group0 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393817 (owner: 10Chad) [20:41:34] (03Merged) 10jenkins-bot: group0 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393817 (owner: 10Chad) [20:41:48] (03CR) 10jenkins-bot: group0 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393817 (owner: 10Chad) [20:42:27] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.10 [20:42:32] !log repooling elastic2004 after RAID controller maintenance - T181412 [20:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:40] T181412: HP RAID Battery issue on elastic2004 - https://phabricator.wikimedia.org/T181412 [20:45:16] (03PS1) 10Andrew Bogott: nova-network dnsmasq: set a deployment-appropriate cname for 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393841 (https://phabricator.wikimedia.org/T181375) [20:45:17] (03PS1) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393842 (https://phabricator.wikimedia.org/T181375) [20:45:49] (03CR) 10jerkins-bot: [V: 04-1] nova-network dnsmasq: set a deployment-appropriate cname for 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393841 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [20:47:42] (03PS2) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393842 (https://phabricator.wikimedia.org/T181375) [20:52:10] (03PS2) 10Andrew Bogott: nova-network dnsmasq: set a deployment-appropriate cname for 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393841 (https://phabricator.wikimedia.org/T181375) [20:52:12] (03PS3) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393842 (https://phabricator.wikimedia.org/T181375) [20:52:34] (03CR) 10jerkins-bot: [V: 04-1] nova-network dnsmasq: set a deployment-appropriate cname for 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393841 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [20:55:28] (03PS1) 10Chad: Remove aawiki from group0, it's weird [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393844 [20:55:30] (03CR) 10Chad: [C: 032] Remove aawiki from group0, it's weird [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393844 (owner: 10Chad) [20:55:32] (03PS1) 10Framawiki: Set $wgNamespaceRobotPolicies for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393845 (https://phabricator.wikimedia.org/T181525) [20:55:34] (03PS3) 10Andrew Bogott: nova-network dnsmasq: set a deployment-appropriate cname for 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393841 (https://phabricator.wikimedia.org/T181375) [20:55:36] (03PS4) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393842 (https://phabricator.wikimedia.org/T181375) [20:56:04] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: removing aawiki from group0 [20:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:49] (03Merged) 10jenkins-bot: Remove aawiki from group0, it's weird [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393844 (owner: 10Chad) [20:57:00] (03CR) 10jenkins-bot: Remove aawiki from group0, it's weird [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393844 (owner: 10Chad) [20:57:31] (03CR) 10Rush: labsaliaser: handle requests for the simple hostname 'puppet' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393842 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [20:58:54] (03PS2) 10Framawiki: Set $wgNamespaceRobotPolicies for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393845 (https://phabricator.wikimedia.org/T181525) [20:59:05] (03PS5) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393842 (https://phabricator.wikimedia.org/T181375) [21:08:38] ccccccevibtjbjfkchkdntlrcitvljfutefcuglrvjdl [21:08:57] damn... I did not even touch it [21:09:31] akosiaris: It's trying to log you in somewhere in the background and has a bug. You should let one of the TLAs know. [21:11:47] greg-g: it also has the damn weak RNG vulnerability from a few weeks ago [21:12:03] I was just having it plugged thiking if I can reuse it in some non vulnerable way [21:12:10] apparently not :) [21:12:12] now I am just thinking about burning it [21:12:59] mutante: so we failed ORES again from codfw [21:13:05] 10Operations, 10User-Joe: [DRAFT][RfC] Deployment of python applications in production - https://phabricator.wikimedia.org/T180023#3744120 (10Pnorman) > You mean you want to download the main application from pypi too? I don't really see that being very useful - I expect us to have to do tailor-made changes to... [21:13:08] sigh... what's going on... [21:15:16] akosiaris: Did you see this note? https://phabricator.wikimedia.org/T181538#3794219 [21:15:25] * akosiaris looking [21:17:45] it does seem to be rewritting the AOF very often [21:18:14] (03PS1) 10Rush: bootstrapvz: add nbd-client package [puppet] - 10https://gerrit.wikimedia.org/r/393853 [21:19:30] disk has a spiky utilization of around 15%.. which is not much [21:19:39] we are talking spikes of 15MB/s [21:19:49] (03CR) 10Smalyshev: Create script for automatic reload of categories (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) (owner: 10Smalyshev) [21:23:54] !log Manually killed all remaining Wikidata JSON dumpers on snapshot1007. Some shards failed due to the db1110 depool. [21:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:18] akosiaris: yes, when it was just on codw we kept having outages and i saw those fsync messages at the same time [21:24:33] (03PS3) 10Smalyshev: Create script for automatic reload of categories [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) [21:24:49] !log Manually killed all remaining Wikidata TTL (RDF) dumpers on snapshot1007. Some shards failed due to the db1110 depool. [21:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:10] only on the 2001, not the 2002 [21:26:18] yeah that's expected [21:26:23] 2002 is a failover passive host [21:26:31] but that sync message on 2001 seems to be an issue [21:26:33] addshore: I'm done with my part of train if you have any Wikidata stuff to do :) [21:30:38] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3794457 (10Halfak) ``` [15:27:23] I was looking at https://github.com/antirez/redis/issues/1019 but it has been "fixed" s... [21:30:38] no_justification: nothing from me! [21:33:04] akosiaris: so other services going berserk on scb200* is ores' doing? [21:33:14] there seem to be a lot of that [21:33:34] mobrovac: not sure, I am actually trying to fix the problem in eqiad right now which is serving the bulk of the traffic [21:33:48] I 'll have a look later on in codfw [21:39:07] mobrovac, we don't know if this is ORES doing at all. [21:39:17] But we know that ORES is struggling. [21:39:54] things seem to have stabilised in codfw though [21:40:43] Right. We're stillw orking on the cause. [21:40:46] You could help :) [21:41:44] !log disable ORES queue redis persistency by config set appendonly no on oresrdb1001 [21:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:04] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3650187 (10ema) The approach we have in mind can roughly be summed up with `varnishncsa | mtail`. We can group the six scrip... [21:42:40] something happened at 20:23 there, i see both mobileapss and cxserver workers massively dying and restarting [21:44:36] mobrovac, I agree that something happened. I don't have evidence for ORES being at the cause "despite looking for it" but I do have evidence that ORES was affected by it. [21:44:44] It seemed to happen with an hourly cadence. [21:44:56] hm [21:44:56] There was a CPU spike at 19:15 too [21:45:39] Oh a bit of evidence that it is related to ORES. These spikes only happened when ORES was serving outside traffic there. [21:45:48] ORES is always serving precaching requests there. [21:46:50] OOM killer did not kill any nodejs services on scb* boxes [21:46:54] I can say that for sure [21:47:10] it did come out though and did kill celery processes [21:47:37] so if nodejs processes were starved for memory and as a result they commited harakiri [21:47:54] which makes sense, I would expect it being an unhandled exception [21:48:08] that would explain the issue [21:49:12] mobrovac: would Nov 28 19:32:32 scb2001 mobileapps[26224]: (node:2216) Warning: Possible EventEmitter memory leak detected. 11 stats listeners added. Use emitter.setMaxListeners() to increase limit [21:49:16] explain anything ? [21:49:29] I am guessing no, just making sure [21:49:43] nah [21:49:48] that's not it [21:50:00] https://9to5mac.com/2017/11/28/how-to-set-root-password/ [21:50:01] uh [21:50:02] systemctl btw does say mobileapps is up since 6 days on scb1001 [21:50:02] woops [21:50:07] that was meant for someone else [21:51:22] on all of scb actually [21:51:29] and for cxserver it's 3 days [21:51:30] yes akosiaris, the workers were killed by the master process because they stopped sending heartbeats, which usually means they needed cpu or ram and couldn't get it (or were caught in a cpu loop) [21:51:32] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=scb&var-instance=All&from=1511289213149&to=1511894013149 [21:51:41] ^ shows weird memory activity for the last week [21:52:40] Starting 11/23 [21:53:58] mobrovac: yeah https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?from=now-6h&to=now&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=scb&var-instance=All does point out some increased CPU usage and of course memory usage [21:54:29] so if heartbeats were not send due to lack of any of that 2, yeah it's easy to explain [21:54:44] yeah [21:55:14] what's the timelines you got ? [21:55:24] and it seems to coincide with the times in quesiton - 19:1x and 20:2x [21:55:35] ~19:15 ? [21:55:41] yeah that is [21:55:42] yeah [21:55:50] ok we explained that at least [21:56:06] Any way we could find out what used all of that CPU? [21:56:16] the spikes in rx and tx are also weird [21:56:31] I'd like to try to confirm. If awight's hypothesis is right, we should see uwsgi dominating it. [21:56:56] no we don't have the information [21:57:45] but the spike is clearly user cpu [21:57:45] BTW, ORES 100% down now and has been for ~1 hour. [21:57:50] so it's an application [21:57:56] Looks like the persistence switch didn't affect much [21:58:40] I am willing to bet it was celery that was to blame on 19:15 [21:58:47] anyway back to the problem at hand [21:58:53] We have a *huge* amount of web workers running right now. [21:59:00] SO something is hammering ORES it seems. [21:59:40] halfak: Peek at the logs… [21:59:40] redis hasn't performed an AOF sync since 21:39 [21:59:44] All of our overloads correspond to a huge increase in active web workers which could either be celery getting jammed or a huge increase in requests. [21:59:54] awight, can we quantify those logs somehow? [22:00:02] yes one moment [22:00:39] btw, ores' main.log file is not readable by world, which makes things harder to debug [22:00:44] halfak: https://logstash.wikimedia.org/goto/17933ea1ee9341f249a63c0d76723eed [22:00:49] app.log is though [22:00:54] It’s… all too possible that we’re DoS’ing ourselves. [22:01:00] right [22:01:12] Not sure why it let us sleep last night before blowing up thought [22:01:15] *though [22:02:07] awight, https://logstash.wikimedia.org/goto/fa2c27189f05982aba1b27e3a57bd802 [22:02:10] That one is more telling :) [22:02:32] /o\ [22:02:37] Somehow this all started at 1910 [22:04:10] Confirmed that there's a very strong correlation between the test_stats graph and our overloads on ORES [22:04:29] halfak: well what we can’t show yet is causality [22:04:32] Both at the 19:1* and 20:2* hour [22:04:35] Right [22:04:42] We know that this is going to happen *after* an overload [22:05:27] (03CR) 10Dzahn: [C: 032] remove Ganglia from cache::canary, cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/393674 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:05:35] (03PS6) 10Dzahn: remove Ganglia from cache::canary, cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/393674 (https://phabricator.wikimedia.org/T177225) [22:07:21] what's the "DoS-ing ourselves" theory look llike? [22:08:21] btw, OOM showed up on scb1003 kiling celery for the first time a 10 mins ago. on 21:55 [22:08:52] bblack, will have a graphic for you shortly. [22:09:25] awight, do you have a task for the test stats issue? [22:09:36] https://phabricator.wikimedia.org/T181534 [22:09:38] Got it [22:09:44] halfak: T181567 [22:09:45] T181567: Rate limit thresholds requests when the service is down - https://phabricator.wikimedia.org/T181567 [22:09:58] Oh [22:10:12] oops—deduped. [22:10:51] [2017-11-28T22:10:38] Exception ignored in: > [22:10:52] [2017-11-28T22:10:38] Traceback (most recent call last): [22:10:52] [2017-11-28T22:10:38] File "/srv/deployment/ores/venv/lib/python3.4/site-packages/enchant/__init__.py", line 576, in __del__ [22:10:52] [2017-11-28T22:10:38] File "/srv/deployment/ores/venv/lib/python3.4/site-packages/enchant/__init__.py", line 638, in _free [22:10:52] [2017-11-28T22:10:38] File "/srv/deployment/ores/venv/lib/python3.4/site-packages/enchant/__init__.py", line 346, in _free_dict [22:10:53] [2017-11-28T22:10:38] File "/srv/deployment/ores/venv/lib/python3.4/site-packages/enchant/__init__.py", line 353, in _free_dict_data [22:10:54] [2017-11-28T22:10:38] TypeError: 'NoneType' object is not callable [22:10:55] bblack, https://phabricator.wikimedia.org/T181567#3794570 shows graphs for ORES down-ness next to massive requests for "test_stats" from MW [22:10:57] I also just saw that ^ [22:11:12] akosiaris, oh weird. No idea what that is. [22:11:17] over 100k info messages from ores on scb200* for the 10-min period surrounding 20:2x [22:11:45] mobrovac, yeah. That fits with the DOS-ing ourselves hypothesis. [22:11:48] what's test_stats? [22:11:49] See https://logstash.wikimedia.org/app/kibana#/dashboard/ORES?_g=h@13e1488&_a=h@b6cc111 [22:11:58] It's an old param that ORES doesn't use anymore. [22:12:08] We think the ORES Ext is hammering the service requesting it anyway. [22:12:20] ah [22:12:28] should we disable the extension for a while ? [22:12:37] that's the nuclear option ofc [22:12:38] can we return a different response to the Ext in that case to nip it? give them a fake 200 OK or whatever and drain it out quickly? [22:12:52] bblack, yeah. That'd be helpful [22:12:58] maybe they're retrying on the 400 they get for the bad param [22:13:00] Shut off that one URL pattern [22:13:02] No I don’t think that would be helpful. [22:13:09] Oh follow awight. [22:13:14] he knows more about ext behavior [22:13:15] Unless the fake response is correctly formed... [22:13:24] awight, but it would stop hitting ORES [22:13:25] * awight scratches backside... [22:13:41] We could also just populate the caches directly if it’s like that [22:13:51] I have a patch ready for review, though [22:13:55] https://gerrit.wikimedia.org/r/#/c/393922/ [22:14:15] awight, +1 for populating the caches directly. [22:14:19] well, to go around populating the cache we need to know the exact query [22:14:33] do we ? [22:14:53] like ... is it a single revision ? [22:15:11] one cached set of "thresholds" per model [22:15:14] It's small and finite. [22:15:37] fine by me then [22:15:45] Give me 5 minutes to try this patch locally. Stuffing caches manually or using varnish is at least as complicated, and not sustainable IMO [22:15:50] What do you think, awight? [22:15:53] OK [22:15:55] agreed [22:16:29] My next Q is going to be "why didn't this happen right after we deployed the extension?" [22:16:32] oh I didn't mean fake the 200's via Varnish (although that's possible, too). I meant have the ORES service, instead of seeing test_stats and throwing a 400, notice the bad param and give a well-formed fake 200 OK response quickly. [22:16:56] Maybe it took this long to decide to refresh the thresholds. [22:17:04] bblack: yeah we are taling about the same thing. ORES has a local "cache" [22:17:11] nothing to do with varnish [22:17:27] In this case, ORES responds to this request really fast. [22:17:37] (03PS2) 10Dzahn: various misc roles: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393708 (https://phabricator.wikimedia.org/T177225) [22:17:46] It does one hash map lookup and says "I don't have that" and returns the 400 [22:17:58] I'm not sure ORES is fast enough to ignore the rate of requests [22:18:24] Expected response time is ~50ms [22:18:33] To my laptop in MN [22:18:35] is the rate of these bad requests sane? or is it that the clients are re-spamming the same requests due to poor client-side handling of 400 as something retryable? [22:18:47] ^ that I think [22:18:55] rate of request are totally insane. [22:19:07] bblack: yes exactly. They’re stampeding, and not caching the fact that they got a failure. [22:19:14] right, so turn them into 200s just to shut them up [22:19:48] bblack: I am looking into https://grafana-admin.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&panelId=22&fullscreen&orgId=1&edit and I do see 60k reqs/min peak today in eqiad misc [22:19:51] Maybe? The response has to actually be appropriate to some degree. I like this plan though. [22:20:02] eqiad misc frontends* [22:20:18] awight, the rate limiting plan or the 200 FU Go away plan? [22:20:28] so we do have some quite elevated external req/s at varnish. I am guessing it's ORES ? [22:20:35] probably [22:20:53] that however kind of shoots down the "we are DOSing ourselves" theory [22:21:03] and we are back to "someone is killing us" theory [22:21:29] we might be exacerbating the problem with some bad responses, but it's not the root cause [22:21:40] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5&from=1511895528236&to=1511907647764 [22:21:57] ^ you can see there that over the problem period, for all of cache_misc, 400 is the most-popular response code, which fits all of this [22:22:12] akosiaris: The tresholds requests would be coming from the mediawiki app servers. Right, the second likely possibility is that we’re following up on service collapse with a bunch of stupid ORES API requests that are bound to fail. [22:22:54] bblack: We return pure 400, yes. [22:23:01] bblack: ok that makes sense indeed [22:24:01] This won't count MW-based requests? [22:24:18] no, this is at the "border" [22:24:40] unless mw does something stupid and has "ores.wikimedia.org" in its configuration [22:24:47] instead of ores.svc.eqiad.wmnet [22:24:48] Any was we could find out the top user-agents that are making these requests? [22:24:50] * akosiaris looking [22:24:55] akosiaris, :) [22:25:00] akosiaris: Ummmm.... [22:25:04] [we've noted before that several internal services make internal requests through public hostnames, FTR, but donno about ORES] [22:25:06] The 400 requests are all MediaWiki, fwiw [22:25:09] * halfak bets on MW having ores.wikimedia.org in its config [22:25:14] +1 ^ [22:25:18] 3418: $wgOresBaseUrl = 'https://ores.wikimedia.org/'; [22:25:25] wmf-config/CommonSettings.php: $wgOresBaseUrl = 'https://ores.wikimedia.org/'; [22:25:30] so yeah... [22:25:31] yeah it shouldn't [22:25:34] This is all coming together. [22:25:36] (I just came across that the other day for unrelated reasons) [22:25:47] (03PS1) 10Hoo man: Fix killing dumpers in Wikidata entity dumpers [puppet] - 10https://gerrit.wikimedia.org/r/393923 [22:25:55] on the other hand, if ores responses are cacheable at all, this might be saving you from some other type of meltdown, I donno [22:26:04] I think they are not at all [22:26:05] in general it would be better to direct internal requests at internal service URIs [22:26:11] we 've had that discussion in the past [22:26:24] I just don't know whether in this moment it would make things worse to fix it [22:26:29] I 'll upload a patch for mediawiki-config [22:26:31] (03CR) 10Hoo man: "Note: I manually tested this." [puppet] - 10https://gerrit.wikimedia.org/r/393923 (owner: 10Hoo man) [22:26:41] (03CR) 10Dzahn: [C: 032] various misc roles: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393708 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:26:45] bblack: I am pretty sure we don't cache much on ORES in varnish [22:26:49] ok [22:26:58] well, it's certainly not caching the 400s :) [22:27:06] maybe some static assets for the swagger specs and the basics for /vX/scores [22:27:11] so very basic stuff [22:27:39] 10Operations, 10ops-eqsin, 10netops: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3794607 (10RobH) [22:27:56] bblack: I am changing it to ores.discovery.wmnet [22:28:01] akosiaris, we don't allow that to be cached now, but we could if we wanted to get a bit more nuanced. [22:28:51] 10Operations, 10ops-eqsin, 10DC-Ops: singapore caching center: eqiad staging tracking task - https://phabricator.wikimedia.org/T166179#3287561 (10RobH) [22:28:53] 10Operations, 10ops-eqsin, 10DC-Ops: singapore caching center: eqiad staging tracking task - https://phabricator.wikimedia.org/T166179#3287561 (10RobH) [22:29:55] (03PS1) 10Alexandros Kosiaris: ORES: Use the internal discovery URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393924 (https://phabricator.wikimedia.org/T181538) [22:30:14] I am going ahead and deploying https://gerrit.wikimedia.org/r/#/c/393924/ [22:30:22] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ORES: Use the internal discovery URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393924 (https://phabricator.wikimedia.org/T181538) (owner: 10Alexandros Kosiaris) [22:30:35] (03CR) 10jenkins-bot: ORES: Use the internal discovery URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393924 (https://phabricator.wikimedia.org/T181538) (owner: 10Alexandros Kosiaris) [22:31:13] awight, I had a look at https://gerrit.wikimedia.org/r/#/c/393922/1 but it's hard for me to give it a good review. [22:31:28] (03PS2) 10Dzahn: remove Ganglia from cache::misc [puppet] - 10https://gerrit.wikimedia.org/r/393675 (https://phabricator.wikimedia.org/T177225) [22:31:34] !log deploy wmf-config/CommonSettings.php for ORES internal discovery URL, https://gerrit.wikimedia.org/r/#/c/393924/ T181538 [22:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:42] T181538: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538 [22:31:43] !log akosiaris@tin Synchronized wmf-config/CommonSettings.php: (no justification provided) (duration: 00m 49s) [22:31:47] * akosiaris should remember next time to use scap's niceties [22:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:57] ok that's done [22:32:18] mediawiki should start using the internal URLs now. We should see varnish reqs dropping [22:32:27] halfak: Thanks—I’m hoping Krinkle has a minute to check that out. [22:35:36] (03CR) 10Dzahn: [C: 032] remove Ganglia from cache::misc [puppet] - 10https://gerrit.wikimedia.org/r/393675 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:36:23] halfak: so about your UA question. The top 4 GETs since this morning are: [22:36:28] 382808 user agent "MediaWiki/1.31.0-wmf.8" [22:36:28] 16278 user agent "Twisted PageGetter" [22:36:28] 16276 user agent "Mozilla/5.0" [22:36:28] 2519 user agent "check_http/v2.1.1 (monitoring-plugins 2.1.1)" [22:36:29] 2387 user agent "dashboard.wikiedu.org production" [22:36:31] well 5 not 4 [22:36:40] ignore Twisted (that's pybal) [22:36:45] and check_http, that's icinga [22:36:51] Thanks. Was guessing it was MediaWiki [22:36:55] so.. mediawiki is causing all our issues [22:37:21] but why does it matter that it went through varnish? [22:37:30] I mean I get that this is better in many senses [22:37:43] but whatever was getting broken is still going to get broken [22:37:46] yes yes [22:37:52] <3 Krinkle [22:37:58] it will now break more-efficiently, though :) [22:38:09] lol [22:38:12] it only mentally disallowed us for a bit to figure out the culprit [22:38:22] we should have looked at UAs way way sooner [22:39:10] 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3794643 (10RobH) [22:39:11] * akosiaris makes a mental note for dashboard.wikiedu.org production. I am wondering what they are doing [22:42:22] there is an ORES tab on https://dashboard.wikiedu.org/campaigns/visiting_scholars_office_use_only/overview [22:42:32] akosiaris: https://dashboard.wikiedu.org/campaigns/visiting_scholars_office_use_only/ores_plot [22:42:45] dunno, but that seems the only reference there [22:42:47] * halfak blames wikiedu [22:42:48] lol [22:43:46] halfak: While the service is down, we might as well turn off the extension. [22:43:51] I 'll not even pretend I understand what that graph is about.. anyway [22:43:59] awight, actually, we're still processing requests. [22:44:04] ah! [22:44:09] The extension is still working. Just crappily! [22:44:11] And slowly [22:44:17] sorry, I heard “100% down” a minute ago [22:44:18] hehe, yea, the "structural completeness of visiting scholars" sounds weeird [22:44:20] kk [22:44:47] awight, right. Fluctuations [22:45:04] Also it seems we're still serving precached and cached scores so I may have spoke too strongly. [22:45:13] akosiaris, if you ever want to talk fun ORES stuff that people do, let me know :D [22:45:32] halfak: you 're on. All hands :-) [22:45:37] :D [22:45:54] Speaking of all hands, I've been getting some Gems for dramatic reading in this channel recently. [22:46:09] lol [22:47:48] actually since 22:29 we are not overloading anymore [22:47:55] Interesting [22:47:56] or having timeout errors [22:48:05] but we do have errored scores [22:48:16] a low level of that is normal. [22:48:28] There's lots of messed up crap in Wikipedia and we can't score some of it. [22:48:31] meanwhile, any mac users here? you saw you need to set a root password, right [22:49:11] akosiaris, this is looking good! [22:49:21] yeah it's a little bit increased... about 2 per minute per https://grafana.wikimedia.org/dashboard/db/ores?panelId=2&fullscreen&orgId=1&from=now-30m&to=now-1m [22:49:29] but I guess it's ok [22:49:38] Maybe you mistyped something and now MW is hitting nothing ;) [22:49:40] yeah things are way better that 30 mins ago [22:49:43] lol [22:50:18] $wgOresBaseUrl = 'https://ores.discovery.wmnet/'; [22:50:20] looks correct [22:50:46] That’s great to see, I’m pretty sad thinking about all the net resources we were wasting. [22:52:07] cache hit rate is also steadily increasing [22:52:39] Want me to send some scoring requests? [22:52:41] and we are having timeouts again :-( [22:52:52] https://grafana.wikimedia.org/dashboard/db/ores?panelId=11&fullscreen&orgId=1&from=now-15m&to=now-1m [22:53:11] I bet that the MW beast just woke up again [22:53:59] (03PS2) 10Dzahn: remove ganglia from cache::text,cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/393676 (https://phabricator.wikimedia.org/T177225) [22:56:49] (03CR) 10Alexandros Kosiaris: [C: 031] ganglia: ensure more things are gone in decom class [puppet] - 10https://gerrit.wikimedia.org/r/393707 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:57:06] (03CR) 10Dzahn: [C: 032] remove ganglia from cache::text,cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/393676 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:57:09] 10Operations, 10Ops-Access-Requests: Requesting access to terbium/wasat for Trey Jones - https://phabricator.wikimedia.org/T181479#3794755 (10EBjune) To the powers that shepherd these controls, **I approve** giving Trey access to terbium/wasat. Thanks! [22:59:06] akosiaris, I think those timeouts are normal. Also an expected pattern [22:59:09] thanks [22:59:14] We shouldn't see *a ton* [22:59:23] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3794768 (10akosiaris) eqiad redis queue for ORES is no longer persisting in disk since 21:39 UTC today in an effort to address this.... [22:59:31] Somehow, I don't see MW hammering us since 22:30 [22:59:51] And honestly it was hammering us pretty consistently. [23:00:12] I do though on logs [23:00:18] [2017-11-28T23:00:06] [pid: 27340] 10.64.32.103 (-) {46 vars in 945 bytes} [Tue Nov 28 23:00:06 2017] GET /scores/yellowiki_1/damaging/?model_info=test_stats&format=json => generated 83 bytes in 1 msecs (HTTP/1.1 404) 6 headers in 214 bytes (1 switches on core 0) user agent "MediaWiki/1.31.0-alpha" [23:00:18] (03PS3) 10Dzahn: ganglia: ensure more things are gone in decom class [puppet] - 10https://gerrit.wikimedia.org/r/393707 (https://phabricator.wikimedia.org/T177225) [23:01:00] (03CR) 10Dzahn: [C: 032] ganglia: ensure more things are gone in decom class [puppet] - 10https://gerrit.wikimedia.org/r/393707 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [23:01:07] -alpha ? [23:01:16] what's -alpha ? wait this is weird [23:01:46] this is an external request [23:01:56] it's not us, it's somebody else [23:01:57] https://logstash.wikimedia.org/goto/8882066759e54dbf43ce8dfb28213dfb [23:02:12] ^ no more test_stat hammer [23:02:38] maybe I did mistype something ? weird though I triple checked it [23:02:44] oh damn [23:02:47] the port [23:03:05] I forgot the TCP port part. sorry [23:03:18] I 've managed to screw up the extension :-( [23:03:19] No worries. This was a good test! [23:03:39] Everything went back to normal when we stopped DOS'ing ourselves. [23:04:05] well fixing.. we should be back to DoSing ourselves pretty soon [23:05:47] (03PS1) 10Alexandros Kosiaris: ORES: Fix $wgOresBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393930 (https://phabricator.wikimedia.org/T181538) [23:05:54] akosiaris, can you post timestamps here when you're done re-enabling: https://phabricator.wikimedia.org/T181567 [23:06:15] Maybe here is better: https://phabricator.wikimedia.org/T181538 [23:06:25] ok will do [23:06:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ORES: Fix $wgOresBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393930 (https://phabricator.wikimedia.org/T181538) (owner: 10Alexandros Kosiaris) [23:07:07] (03CR) 10jenkins-bot: ORES: Fix $wgOresBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393930 (https://phabricator.wikimedia.org/T181538) (owner: 10Alexandros Kosiaris) [23:07:12] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794775 (10Halfak) OK. Dominant hypothesis is that we are DOS-ing ourselves via the ORES Ext. When we accidentally broke the $wgOresBaseUrl, the... [23:07:44] !log akosiaris@tin Synchronized wmf-config/CommonSettings.php: T181538 (duration: 00m 49s) [23:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:51] T181538: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538 [23:08:31] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794779 (10Halfak) BTW, see T181567 where we initially describe the correlation between failed "test_stats" request from MediaWiki and the downtime... [23:08:43] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794781 (10Halfak) [23:09:43] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793518 (10Halfak) [23:09:44] oh yeah, precaching scores are showing up in grafana once more [23:09:45] 10Operations, 10Scoring-platform-team (Current), 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3794783 (10Halfak) 05Open>03Resolved a:03Halfak This looks done to me. Feel free to re-open [23:09:50] and tons of logs [23:10:06] GET /scores/etwiki/damaging/?model_info=test_stats&format=json [23:10:20] all from mediawiki, all of that format ^ [23:10:36] I am guessing we should start seeing issues soon [23:13:16] Shows the whole period so far: https://logstash.wikimedia.org/goto/91568910c65b23afcbfab5f15120c7e1 [23:13:25] and the overloads started [23:13:25] Complete with broken URL period [23:13:33] SCIENCE [23:13:35] I am seeing about 100/min right now [23:13:43] We just did an experiment and now we have causation! [23:13:46] \o/ [23:14:07] lol [23:15:35] (03PS1) 10BBlack: bast4002: add proper roles [puppet] - 10https://gerrit.wikimedia.org/r/393933 [23:15:37] (03PS1) 10BBlack: bast4002: add to network::constants for ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/393934 [23:15:39] (03PS1) 10BBlack: bast4002: use as canary and smokeping targets [puppet] - 10https://gerrit.wikimedia.org/r/393935 [23:15:41] (03PS1) 10BBlack: bast4002: use as ulsfo tftp [puppet] - 10https://gerrit.wikimedia.org/r/393936 [23:16:29] (03CR) 10jerkins-bot: [V: 04-1] bast4002: add proper roles [puppet] - 10https://gerrit.wikimedia.org/r/393933 (owner: 10BBlack) [23:16:37] jerkins ? [23:16:59] I just noticed that [23:17:00] it's jerkins when it adds V -1 [23:17:01] :P [23:17:10] since when ? [23:17:11] * Reedy pats akosiaris on the head [23:17:15] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794850 (10Halfak) It looks like we have causal evidence that MW is hammering ORES into the ground. Now that $wgOresBaseUrl is fixed, ORES is aga... [23:17:15] Ages [23:17:18] legoktm: ^ [23:17:20] for the love of ... [23:17:28] uh [23:17:31] I 've been around for 5 years and never seen it ? [23:17:34] <3 [23:17:38] by the way, the roles for bastions, i really wanted to make that a single one at https://gerrit.wikimedia.org/r/#/c/353599/ [23:17:41] i just cant compile it [23:17:47] akosiaris: don't worry, it's only been around for months now [23:17:59] because i get (unrelated) https://phabricator.wikimedia.org/T180671 [23:18:08] ah, phew at least it's not 5 years [23:18:11] hahah, no [23:18:24] legoktm: you are not saying that to make me feel better, right ? [23:18:30] I should look at the code [23:18:53] I guess I can ignore jerkins [23:18:56] ;) [23:19:05] (03PS1) 10Dzahn: druid/statistics: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393938 (https://phabricator.wikimedia.org/T177225) [23:19:13] 23:16:25 wmf-style: total violations delta 1 [23:19:13] 23:16:25 NEW violations: [23:19:13] https://github.com/wikimedia/labs-tools-wikibugs2/blame/7f85b1857fbd887b53523ab4a312673b31db0ba4/grrrrit.py#L123 [23:19:14] 23:16:25 manifests/site.pp:123 wmf-style: node 'bast4002.wikimedia.org' includes multiple roles [23:19:21] bblack: yes, ignore it and say that https://gerrit.wikimedia.org/r/#/c/353599/ should fix it later [23:19:25] akosiaris: 11 months [23:19:38] Did grrrit-wm do it before that? [23:19:52] well, or we can do that change, i just wish i could compile it on all bastions, not just 2 [23:20:11] yeah [23:20:17] 11 months.... [23:20:24] hehe [23:20:30] I need to unhardwire my brain [23:20:42] just get more downvotes :) [23:20:52] I 've been downvoted by jerkins quite a few times [23:20:57] never saw it [23:21:31] I 'll schedule an appointment with my ophthalmologist [23:22:59] (03CR) 10BBlack: [V: 032 C: 032] bast4002: add proper roles [puppet] - 10https://gerrit.wikimedia.org/r/393933 (owner: 10BBlack) [23:23:02] (03CR) 10BBlack: [C: 032] bast4002: add to network::constants for ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/393934 (owner: 10BBlack) [23:23:51] (03PS2) 10Dzahn: druid/statistics: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393938 (https://phabricator.wikimedia.org/T177225) [23:26:45] Reedy: grrrit-wm was supposed to, but the feature was broken. so I fixed it when re-implementing it [23:28:15] (03CR) 10Dzahn: [C: 032] druid/statistics: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393938 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [23:30:31] ah, right, the new bastions means new ferm rule on everything [23:30:40] and therefore restart of ferm on everything [23:31:16] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794913 (10akosiaris) Adding OOM kernel logs per host for posterity's sake. Feel free to ignore `electron` in the logs. It's already being memory... [23:33:11] restart of ferm on everything? [23:33:26] well, editing of rules on everything, but it will happen on its own [23:33:47] bblack: well, "refresh of service" [23:33:58] to add the new rule to allow connectiosn from the new bastion [23:34:07] yea [23:35:31] how close are we to ganglia decom? [23:35:45] (03PS2) 10BBlack: bast4002: use as canary and smokeping targets [puppet] - 10https://gerrit.wikimedia.org/r/393935 [23:35:47] (03PS2) 10BBlack: bast4002: use as ulsfo tftp [puppet] - 10https://gerrit.wikimedia.org/r/393936 [23:35:49] (03PS1) 10BBlack: bast4002: switch over prometheus+ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393943 [23:36:11] i literally just wiped it from cache servers in the last hour [23:36:25] there are some left, stuff like exim and maps [23:36:43] \o/ [23:36:45] ok [23:36:46] bblack: if you are wondering about addding ganglia aggregator on bastion, no, not needed [23:36:46] nice! [23:36:55] I guess purging package and such too right? [23:37:28] ok, yeah, I'll amend my commits and just leave out moving ganglia over for 4002 I guess, and manually remove the ganglia packages/config? [23:37:55] yea, i just did that . this: https://gerrit.wikimedia.org/r/#/c/393676/ applies this decom class https://gerrit.wikimedia.org/r/#/c/393707/3/modules/ganglia/manifests/monitor/decommission.pp [23:38:55] bblack: the best is to just add "standard::has_ganglia: false" in hiera to it [23:39:04] it will remove it for you [23:39:16] it's ok I can get it [23:40:20] (03PS3) 10BBlack: bast4002: use as canary and smokeping targets [puppet] - 10https://gerrit.wikimedia.org/r/393935 [23:40:22] (03PS3) 10BBlack: bast4002: use as ulsfo tftp [puppet] - 10https://gerrit.wikimedia.org/r/393936 [23:40:24] (03PS2) 10BBlack: bast4002: switch over prometheus [puppet] - 10https://gerrit.wikimedia.org/r/393943 [23:40:26] (03PS1) 10BBlack: bast4002: remove ganglia aggregator [puppet] - 10https://gerrit.wikimedia.org/r/393944 [23:40:38] ok, later i will purge it from all bastions by role. it will just be one of the last steps to kill the aggreagtors there.. until we have replacement for psql, exim [23:40:44] metrics [23:43:03] (03CR) 10BBlack: [C: 032] bast4002: remove ganglia aggregator [puppet] - 10https://gerrit.wikimedia.org/r/393944 (owner: 10BBlack) [23:43:07] (03CR) 10BBlack: [C: 032] bast4002: use as canary and smokeping targets [puppet] - 10https://gerrit.wikimedia.org/r/393935 (owner: 10BBlack) [23:43:12] (03CR) 10BBlack: [C: 032] bast4002: use as ulsfo tftp [puppet] - 10https://gerrit.wikimedia.org/r/393936 (owner: 10BBlack) [23:54:58] /win 13 [23:58:16] !log Ran scap pull on mwdebug1001 after T181385 related testing [23:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:24] T181385: Wikidata truthy nt dumpers stuck with 100% CPU on snapshot1007 - https://phabricator.wikimedia.org/T181385 [23:59:12] mutante: I killed all the ganglia stuff manually, but puppet added it back (not as an aggregator, but as a client, so I guess something in the bast/installserver roles still has config for ganglia client-side stuff)