[00:00:07] (03PS5) 10Alex Monk: Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 [00:02:21] (03CR) 10jerkins-bot: [V: 04-1] Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 (owner: 10Alex Monk) [00:07:30] (03PS6) 10Alex Monk: Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 [00:07:50] (03CR) 10Alex Monk: "Looks like it's this line in the commit linked above changing things to None:" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 (owner: 10Alex Monk) [00:09:12] (03CR) 10jerkins-bot: [V: 04-1] Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 (owner: 10Alex Monk) [00:12:46] (03PS3) 10Dzahn: icinga: have a default notes_url for all services [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) [00:13:22] (03CR) 10Dzahn: "amended to use sections on a single wiki page. before merging this i'll bring it up in monitoring meeting and/or a list mail" [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [00:14:04] (03CR) 10jerkins-bot: [V: 04-1] icinga: have a default notes_url for all services [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [00:18:40] (03PS1) 10Dduvall: ci: Give Docker more space on large-disk instances [puppet] - 10https://gerrit.wikimedia.org/r/459875 (https://phabricator.wikimedia.org/T203841) [00:19:11] (03CR) 10jerkins-bot: [V: 04-1] ci: Give Docker more space on large-disk instances [puppet] - 10https://gerrit.wikimedia.org/r/459875 (https://phabricator.wikimedia.org/T203841) (owner: 10Dduvall) [00:20:23] (03PS1) 10Dzahn: tor::relay: make Tor family configurable and move to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/459876 [00:20:29] (03PS7) 10Alex Monk: Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 [00:21:49] (03CR) 10Dzahn: "re: setting family -> https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/459876/" [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [00:24:24] (03CR) 10Alex Monk: "So I think that's because in ACMEAccount.create we were starting it off as a dict instead of a acme.messages.Registration object, which ul" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 (owner: 10Alex Monk) [00:25:23] (03PS8) 10Alex Monk: Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 [00:25:31] (03PS1) 10Dzahn: remove hosts/radium.yaml from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/459878 (https://phabricator.wikimedia.org/T203861) [00:27:05] (03Abandoned) 10Dduvall: Parameterize profile::labs::lvm::srv volume size [puppet] - 10https://gerrit.wikimedia.org/r/459850 (https://phabricator.wikimedia.org/T203842) (owner: 10Dduvall) [00:27:47] (03CR) 10Alex Monk: "actually I think that's the fault of a commit further up the chain" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459841 (owner: 10Alex Monk) [00:29:46] (03PS2) 10Dduvall: ci: Give Docker more space on large-disk instances [puppet] - 10https://gerrit.wikimedia.org/r/459875 (https://phabricator.wikimedia.org/T203841) [00:30:20] (03CR) 10jerkins-bot: [V: 04-1] ci: Give Docker more space on large-disk instances [puppet] - 10https://gerrit.wikimedia.org/r/459875 (https://phabricator.wikimedia.org/T203841) (owner: 10Dduvall) [00:30:27] (03PS4) 10Alex Monk: api: Also handle SIGHUP signals to the API process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 [00:30:39] (03PS4) 10Alex Monk: Be a lot more verbose about problems in the ACME process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459798 [00:30:45] (03PS3) 10Alex Monk: Log command we run for DNS zone updates [software/certcentral] - 10https://gerrit.wikimedia.org/r/459799 [00:30:50] (03PS2) 10Alex Monk: setup.py test dependencies: Remove pylint maximum version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459811 [00:30:54] (03PS2) 10Alex Monk: Compatibility with new flask version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459841 [00:30:59] (03PS9) 10Alex Monk: Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 [00:38:33] (03PS3) 10Dduvall: ci: Give Docker more space on large-disk instances [puppet] - 10https://gerrit.wikimedia.org/r/459875 (https://phabricator.wikimedia.org/T203841) [00:42:32] (03PS4) 10Dduvall: ci: Give Docker more space on large-disk instances [puppet] - 10https://gerrit.wikimedia.org/r/459875 (https://phabricator.wikimedia.org/T203841) [00:47:07] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10Legoktm) It's unclear to me why we're continuing to invest so much time in getting HHVM to work when we're going t... [00:48:37] PROBLEM - Disk space on elastic1022 is CRITICAL: DISK CRITICAL - free space: /srv 50529 MB (10% inode=99%) [00:54:07] RECOVERY - Disk space on elastic1022 is OK: DISK OK [00:59:12] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10Legoktm) ``` legoktm@deploy1001:~$ time PHP=php7.0 mwscript rebuildLocalisationCache.php --wiki=enwiki --outdir=/t... [01:11:35] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10mmodell) @legoktm: There was some concern about incompatibilities between the mbstring in php7 vs hhvm and an asse... [01:12:07] (03PS1) 10Legoktm: mediawiki: Clean up php7 package list [puppet] - 10https://gerrit.wikimedia.org/r/459881 [01:13:26] (03PS1) 10Legoktm: mediawiki: Install php-dba in PHP 7 [puppet] - 10https://gerrit.wikimedia.org/r/459882 [01:15:07] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS1257/IPv6: Connect, AS1257/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:24:57] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 10, down: 0, shutdown: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:27:57] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:30:07] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:32:37] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS1257/IPv6: Connect, AS1257/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:46:57] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:03:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:05:47] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:28:09] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.20) (duration: 08m 30s) [02:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:35] (03CR) 10Dzahn: "i would still move the family string to Hiera as in the change above, but isn't it going to be the same for all 3 instances so technically" [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [02:38:53] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Wed Sep 12 02:38:53 UTC 2018 (duration 10m 44s) [02:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:39] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10Krinkle) I share the same concern as what @mmodell remembers. Having said that, I believe at this point in time th... [02:44:16] (03CR) 10Dzahn: [C: 032] remove hosts/radium.yaml from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/459878 (https://phabricator.wikimedia.org/T203861) (owner: 10Dzahn) [02:47:24] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) [03:00:11] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) Hi @tramm i took your request. I see you want to transfer the domain entirely to Wikimedia Eesti. I'm contacting legal because they handle the domain re... [03:27:17] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 931.08 seconds [03:56:47] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 299.01 seconds [04:13:16] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10Legoktm) I ran a script to diff the CDBs I just generated with PHP 7.0, and there's no functional diff (just some... [04:43:04] (03CR) 10Krinkle: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/447654 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [04:56:40] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/450900 (https://phabricator.wikimedia.org/T199962) (owner: 10KartikMistry) [05:01:39] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: add SSDs to wdqs100[45] - https://phabricator.wikimedia.org/T202779 (10Smalyshev) 05Open>03Resolved [05:01:49] 10Operations, 10ops-codfw, 10Discovery, 10Wikidata, and 2 others: add SSDs to wdqs200[12] - https://phabricator.wikimedia.org/T202777 (10Smalyshev) 05Open>03Resolved a:03Smalyshev [05:02:25] 10Operations, 10ops-codfw, 10Discovery, 10Wikidata, and 2 others: add ssds to wdqs2003 - https://phabricator.wikimedia.org/T202778 (10Smalyshev) 05Open>03Resolved [05:03:59] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Smalyshev) Is something happening on this or this was shelved for now? [05:04:56] PROBLEM - Disk space on elastic1028 is CRITICAL: DISK CRITICAL - free space: /srv 51152 MB (10% inode=99%) [05:05:57] RECOVERY - Disk space on elastic1028 is OK: DISK OK [05:10:46] (03CR) 10jerkins-bot: [V: 04-1] hfst: Sync package from Debian [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/450900 (https://phabricator.wikimedia.org/T199962) (owner: 10KartikMistry) [05:39:36] (03PS1) 10Marostegui: analytics-grants.sql.erb: Add new grants [puppet] - 10https://gerrit.wikimedia.org/r/459896 (https://phabricator.wikimedia.org/T200801) [05:41:52] (03CR) 10Marostegui: [C: 032] analytics-grants.sql.erb: Add new grants [puppet] - 10https://gerrit.wikimedia.org/r/459896 (https://phabricator.wikimedia.org/T200801) (owner: 10Marostegui) [05:46:23] 10Operations, 10DBA, 10Patch-For-Review: Puppetize grants for mysql analytics servers - https://phabricator.wikimedia.org/T114476 (10Marostegui) 05Open>03Resolved I have renamed the file from `research-grants.sql.erb` to `analytics-grants.sql.erb` so we can have all the users that are actually active (T2... [06:10:38] legoktm: the ticket you poked me toward yesterday had a dupe, ( https://phabricator.wikimedia.org/T97368 ), should it be high again? [06:10:48] addshore: yes [06:11:17] oh wait, it did get set to high... [06:11:20] addshore: to quote from the previous task, "This will cause an outage soon, needs to be fixed" [06:11:46] yup [06:13:33] 10Operations, 10DBA, 10MediaWiki-extensions-Translate, 10Datacenter-Switchover-2018 , 10Wikimedia-production-error: DBPerformance warning "Query returned 22186 rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10jcrespo) [06:15:23] legoktm: guess I need to find out how it used use the most / which method in CachingPropertyInfoLookup is called the most.... [06:16:27] addshore: yeah, and then figure out how to shard the cache or store it in something else like on disk or apc maybe? [06:16:38] yup [06:17:00] addshore: also I am skeptical that the cache key needs to be split by $wgVersion, I remember that being needed very early on when the serialization of objects kept changing but I don't think that's still the case [06:19:20] we could definitely just do it with some var passed in and bump it when the structure changes [06:20:06] const VERSION = 1 [06:20:15] like everywhere else in MediaWiki :) [06:21:24] yup :) [06:22:08] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Performance-Team, and 3 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) >>! In T97368#1241094, @daniel wrote: > One... [06:23:33] legoktm: the dc centre switch over is happening but swat i still happening right? [06:24:01] according to https://wikitech.wikimedia.org/wiki/Deployments#Wednesday,_September_12 yes [06:25:38] (03CR) 10Elukey: [C: 031] icinga::performance: remind users to ignore checks using notes_url [puppet] - 10https://gerrit.wikimedia.org/r/459864 (https://phabricator.wikimedia.org/T203485) (owner: 10Dzahn) [06:30:42] (03Abandoned) 10Muehlenhoff: Test 4.14 netboot image for backup2001 [puppet] - 10https://gerrit.wikimedia.org/r/457930 (owner: 10Muehlenhoff) [06:34:53] 10Operations, 10DBA, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10jcrespo) > If we had some other MySQL cluster that would be the best option 2.2GB of data is a ridiculous small amount of data, and it would fit comfortably in one... [06:35:47] !log installing confuse security updates [06:35:51] (03PS1) 10Addshore: Debug logging for T97368 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459906 (https://phabricator.wikimedia.org/T97368) [06:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:34] legoktm: ^^ gonna just do it in a debug log, then we should be able to know what we want to do and pick the ticket up tomorrow :) [06:37:20] addshore: you know there's also 'AdHocDebug' which you can just use whenever? [06:37:33] *looks* [06:37:52] as a log channel [06:37:53] * addshore looks to see if anyone is using it right now [06:38:05] anyways, whatever works for you :) [06:38:06] (03PS1) 10Muehlenhoff: Add library hint for libconfuse [puppet] - 10https://gerrit.wikimedia.org/r/459907 [06:38:17] ooh, I'll use that [06:39:21] (03Abandoned) 10Addshore: Debug logging for T97368 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459906 (https://phabricator.wikimedia.org/T97368) (owner: 10Addshore) [06:40:09] legoktm: do I be evil and backport it now or wait for swat... [06:40:36] uh well I'm going to sleep now, so if you're evil, I won't be invovled ;) [06:40:42] hahahaha [06:40:47] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is CRITICAL: 53.46 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:42:27] (03CR) 10Muehlenhoff: [C: 032] Add library hint for libconfuse [puppet] - 10https://gerrit.wikimedia.org/r/459907 (owner: 10Muehlenhoff) [06:44:07] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is OK: (C)60 le (W)70 le 74.62 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:56:55] !log mobrovac@deploy1001 Started deploy [proton/deploy@ecb9a0e]: Update to Puppeteer v1.7.0 and fix browser connection abort handling - T181623 [06:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:02] T181623: Chromium-render doesn't handle browser connection abort well - https://phabricator.wikimedia.org/T181623 [06:57:14] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Laurentius) >>! In T199252#4575757, @kaldari wrote: > Do we know how many pa... [06:57:19] * addshore is going to squeeze https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/459905/ out the door (adding some debug logging) on wmf.20 [06:57:53] !log mobrovac@deploy1001 Finished deploy [proton/deploy@ecb9a0e]: Update to Puppeteer v1.7.0 and fix browser connection abort handling - T181623 (duration: 00m 58s) [06:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:23] * addshore waits for jenkins [07:13:37] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is CRITICAL: 51.18 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:21:27] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is OK: (C)60 le (W)70 le 71.93 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:26:05] * addshore continues waiting for jenkins [07:33:04] finally merged... [07:38:06] !log addshore@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/Wikibase/lib/includes/Store/: Debug logging for T97368 [[gerrit:459905]] (duration: 00m 53s) [07:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:15] T97368: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 [07:40:02] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Performance-Team, and 4 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10hashar) [07:42:20] 10Operations, 10Puppet, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1): exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10fgiunchedi) Indeed that's what's going on due to lack of exp... [07:42:27] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.9467 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [07:42:56] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.934 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [07:43:47] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.9309 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [07:43:47] that must be me [07:43:50] * addshore will revert it [07:44:55] syncing ... [07:45:12] (03PS1) 10Jcrespo: mariadb: Depool db1098 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459974 [07:45:39] !log addshore@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/Wikibase/lib/includes/Store/: REVERT: Debug logging for T97368 [[gerrit:459905]] (duration: 00m 51s) [07:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:47] T97368: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 [07:46:57] addshore: looks like it yeah, perhaps sampled logging? [07:47:03] as an alternative that is [07:47:10] yup, or also not sending it to logstash [07:47:38] though I guess that might still hit the same issue [07:48:07] well, I'll have a look at the 7 mins of data I have anyway [07:48:28] likely, udp2log is generally more performant in receiving than logstash but would be flooded indeed [07:49:12] I checked the rate and it looks like it was the same rate as api.log, but that has 'api' => [ 'logstash' => false ], [07:49:16] which I overlooked [07:49:33] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1098 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459974 (owner: 10Jcrespo) [07:50:16] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [07:50:36] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [07:50:52] (03Merged) 10jenkins-bot: mariadb: Depool db1098 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459974 (owner: 10Jcrespo) [07:51:36] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [07:53:54] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1098 (duration: 00m 50s) [07:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:36] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is CRITICAL: 59.84 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:56:34] (03CR) 10jenkins-bot: mariadb: Depool db1098 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459974 (owner: 10Jcrespo) [07:59:00] godog: I might do some sampled grafana tracking, but I'm going to leave this until later this week / next week now [07:59:18] addshore: ack, thanks for the quick reaction! [07:59:38] 10Operations, 10Analytics, 10hardware-requests: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10elukey) >>! In T203087#4574561, @Krenair wrote: > No switch port disabling step for VMs either Yep this makes sense since the underlying ganeti ho... [08:00:03] 10Operations, 10Analytics, 10hardware-requests: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10elukey) [08:00:28] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [08:00:55] !log mobrovac@deploy1001 Started restart [proton/deploy@ecb9a0e]: (no justification provided) [08:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:01] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Alert when elasticsearch has shards larger than a maximum size - https://phabricator.wikimedia.org/T203546 (10Mathew.onipe) Output of testing the shard size check script on relforge: onimisionipe@relforge1001:~/tests$ py... [08:03:57] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy [08:05:27] !log stopping db1098 (both db instances) for maintenance [08:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:36] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is CRITICAL: 58.93 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:05:59] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10fgiunchedi) Adding #monitoring for visibility/discussion. This is unfortunately one of the cases where icinga "multi tenancy" model breaks down, na... [08:08:57] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is OK: (C)60 le (W)70 le 81.39 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:14:10] Hey there Ops! Happy server switch day! You got this :) [08:14:57] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1098 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459975 [08:17:45] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Performance-Team, and 5 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) So the logging I added only ran for 7 minut... [08:22:19] (03PS2) 10Volans: cookbook: split main() into parse_args() and run() [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) [08:24:10] (03CR) 10Volans: "Chatting with various people it was not clear if the setup() step was a good move, as it allows for too generic stuff, introduce polymorph" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:24:22] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Scap should use Eval.Jit=1 when calling rebuildLocalisationCache.php via HHVM - https://phabricator.wikimedia.org/T203680 (10hashar) The number of threads is irrelevant as shown on T191921#4248767: > * 1 thread (32 cores): 1... [08:24:27] (03CR) 10Gehel: Elasticsearch shard size check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [08:34:38] (03PS1) 10Jcrespo: mariadb: Depool db1121 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459976 [08:36:10] !log repair sdc on ms-be2041 - T199198 [08:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:17] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [08:44:26] 10Operations, 10Maps-Sprint, 10Maps (Tilerator), 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10Gehel) Oh, I completely forgot about `populate_admin()`! This might have been generating lock contention. Not exactly sur... [08:45:05] (03CR) 10Volans: [C: 04-1] "Much nicer! Almost there. One small bug and few totally optional comments in addition of the last ones from Guillaume." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [08:47:16] (03CR) 10Gehel: [C: 04-1] Elasticsearch shard size check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [08:51:42] (03CR) 10Filippo Giunchedi: [C: 031] mtail: restart on change to exim program [puppet] - 10https://gerrit.wikimedia.org/r/459789 (owner: 10Herron) [08:55:16] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1098 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459975 (owner: 10Jcrespo) [08:56:37] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1098 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459975 (owner: 10Jcrespo) [08:58:03] 10Operations, 10Maps-Sprint, 10Maps (Tilerator): Log slow queries on - https://phabricator.wikimedia.org/T204106 (10Gehel) p:05Triage>03High [08:58:52] (03PS2) 10Jcrespo: mariadb: Depool db1121 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459976 [08:59:40] jynus: ^ will you update db1125 (sanitarium slave that hangs from db1121) and labs? [08:59:50] like upgrade mysql there too [09:00:35] yes, any issue? [09:00:46] nope, just wondering :-) [09:01:04] I only plan to do the production one [09:01:32] I want to check 10.1.36 with real load before mass upgrading to it [09:01:37] so db1125 and labsd will remain with 1.35? [09:01:42] for now [09:01:58] cool, that is also a good test to have a "master" with 1.36 and slaves with 1.35 [09:01:59] upgrading labs takes a bit more time, a whole day [09:02:17] 10Operations, 10Maps-Sprint, 10Maps (Tilerator): Log slow queries on - https://phabricator.wikimedia.org/T204106 (10Gehel) @Mathew.onipe : if you start looking into this task, a few pointers: * the puppet postgresql module is https://github.com/wikimedia/puppet/tree/production/modules/postgresql * we want t... [09:02:44] onimisionipe: ^ we can have a look into this at some point [09:02:46] I know 36 is larger than 35, but I am confident no issues will arise [09:03:02] yeah, that is what I was saying ,that it will be a good test [09:03:13] In case we want to also upgrade eqiad masters while they are inactive [09:03:21] and then slowly eqiad replicas [09:03:49] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1121 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459976 (owner: 10Jcrespo) [09:04:55] I chose it because I liked it as you said [09:05:02] (03Merged) 10jenkins-bot: mariadb: Depool db1121 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459976 (owner: 10Jcrespo) [09:05:14] and also because it was on 10.1.32 and needed a reboot and upgrade [09:05:21] yeah :-) [09:05:38] * marostegui excited to get all the masters upgraded to 10.1 soon! :) [09:07:01] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1098 (s6 and s7) and depool db1121 (duration: 00m 49s) [09:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:10] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1121 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459979 [09:15:57] RECOVERY - Filesystem available is greater than filesystem size on ms-be2041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [09:16:56] PROBLEM - Host backup2001 is DOWN: PING CRITICAL - Packet loss = 100% [09:17:39] ^ that's me, silencing [09:22:06] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1098 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459975 (owner: 10Jcrespo) [09:22:08] (03CR) 10jenkins-bot: mariadb: Depool db1121 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459976 (owner: 10Jcrespo) [09:25:28] jynus: watch out, beta scap fails with \nFatal error: syntax error, unexpected $end, expecting ']' in /srv/mediawiki-staging/wmf-config/InitialiseSettings.php on line 13786\n") [09:25:33] not sure whether it is related to your patch [09:28:13] I didn't touch InitialiseSettings.php [09:28:29] yeah works fine now [09:28:35] ? [09:28:36] I have no idea what might have happened [09:29:10] scap ran with no error on production [09:29:13] twice [09:29:55] it just a fast alarm from beta cluster [09:30:12] I just shouted here since I noticed some db related changes :] [09:30:34] must have been a cosmic ray [09:38:57] PROBLEM - MariaDB Slave Lag: s2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.15 seconds [09:42:28] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Performance-Team, and 5 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10daniel) Another option would be to force this to go i... [09:45:16] 10Operations: Add favicon to icinga an tendril - https://phabricator.wikimedia.org/T204110 (10jcrespo) [09:45:26] 10Operations: Add favicon to icinga an tendril - https://phabricator.wikimedia.org/T204110 (10jcrespo) p:05Triage>03Lowest [09:46:32] !log Drop users: connect, status and fabmigrate from dbstore1002 - T200801 [09:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:44] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Performance-Team, and 5 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) >>! In T97368#4576510, @daniel wrote: > Ano... [09:52:51] 10Operations: Add favicon to icinga an tendril - https://phabricator.wikimedia.org/T204110 (10jcrespo) Suggestions (didn't check license): * https://github.com/Icinga/icinga-web/blob/master/pub/images/icinga/favicon.ico * https://commons.wikimedia.org/wiki/File:Database_error.png [09:56:47] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) >>! In T200297#4575766, @awight wrote: > I'm making some changes to the proposal, which I hope em... [09:57:13] !log restart db1121 for upgrade [09:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:58] (03PS1) 10Ema: site: make cp1099 the new pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/459989 (https://phabricator.wikimedia.org/T202966) [10:05:47] (03PS4) 10Elukey: Remove meitnerium (old archiva host) from puppet [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) [10:10:07] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1121 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459979 (owner: 10Jcrespo) [10:11:28] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1121 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459979 (owner: 10Jcrespo) [10:12:38] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1121 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459979 (owner: 10Jcrespo) [10:25:18] (03PS1) 10Jcrespo: mariadb: Depool db1110 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459992 [10:29:08] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1110 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459992 (owner: 10Jcrespo) [10:30:21] (03Merged) 10jenkins-bot: mariadb: Depool db1110 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459992 (owner: 10Jcrespo) [10:32:47] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1121, depool db1110 (duration: 00m 50s) [10:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:51] (03PS1) 10Elukey: mariadb::service: remove old require causing issues on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/459994 (https://phabricator.wikimedia.org/T204074) [10:40:25] (03CR) 10jerkins-bot: [V: 04-1] mariadb::service: remove old require causing issues on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/459994 (https://phabricator.wikimedia.org/T204074) (owner: 10Elukey) [10:40:35] (03CR) 10Elukey: [C: 04-1] "Jaime/Manuel: I created the code change so I can easily test this in labs and report back to the task, no intention to merge now." [puppet] - 10https://gerrit.wikimedia.org/r/459994 (https://phabricator.wikimedia.org/T204074) (owner: 10Elukey) [10:41:34] (03PS2) 10Elukey: mariadb::service: remove old require causing issues on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/459994 (https://phabricator.wikimedia.org/T204074) [10:41:57] (03CR) 10Elukey: [C: 04-2] "Waiting for more info in the task and testing in labs" [puppet] - 10https://gerrit.wikimedia.org/r/459994 (https://phabricator.wikimedia.org/T204074) (owner: 10Elukey) [10:44:43] (03CR) 10jenkins-bot: mariadb: Depool db1110 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459992 (owner: 10Jcrespo) [10:46:20] (03CR) 10Jcrespo: "Small nitpick." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459994 (https://phabricator.wikimedia.org/T204074) (owner: 10Elukey) [10:49:44] !log restart db1110 for upgrade [10:52:17] jynus: Failed to log message to wiki. Somebody should check the error logs. [10:54:54] !log restart db1110 for upgrade [10:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:55] (03CR) 10Mathew.onipe: Elasticsearch shard size check (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [10:56:43] !log uploaded poolcounter 1.0.4+deb9u1 to apt.wikimedia.org (T199876) [10:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:50] T199876: Migrate pool counters to stretch - https://phabricator.wikimedia.org/T199876 [10:58:50] (03PS16) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [10:59:26] 10Operations, 10Maps-Sprint, 10Maps (Tilerator): Log slow queries on - https://phabricator.wikimedia.org/T204106 (10Mathew.onipe) a:03Mathew.onipe [10:59:49] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180912T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:06] 10Operations, 10Maps-Sprint, 10Discovery-Search (Current work), 10Maps (Tilerator): Log slow queries on - https://phabricator.wikimedia.org/T204106 (10Mathew.onipe) [11:00:51] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: nova-network: also distribute default route in dnsmasq [puppet] - 10https://gerrit.wikimedia.org/r/459998 (https://phabricator.wikimedia.org/T202636) [11:02:18] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1110 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460001 [11:04:35] (03PS17) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [11:05:25] o/ [11:05:28] no patches, nice :D [11:05:41] I'm around in case anything needs to be deployed [11:11:46] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1110 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460001 (owner: 10Jcrespo) [11:12:16] zeljkof: I may deploy something [11:12:26] I mean, aside from that ^ [11:12:32] for mediawiki-configuration [11:12:59] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1110 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460001 (owner: 10Jcrespo) [11:13:03] jynus: ok [11:15:02] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1110 with low load (duration: 00m 50s) [11:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:27] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: nova-network: also distribute default route in dnsmasq [puppet] - 10https://gerrit.wikimedia.org/r/459998 (https://phabricator.wikimedia.org/T202636) (owner: 10Arturo Borrero Gonzalez) [11:17:00] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1110 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460001 (owner: 10Jcrespo) [11:17:56] (03PS3) 10Elukey: mariadb::service: remove old require causing issues on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/459994 (https://phabricator.wikimedia.org/T204074) [11:25:12] 10Operations, 10Analytics, 10hardware-requests: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10akosiaris) [11:26:03] (03CR) 10Alexandros Kosiaris: [C: 031] Decommission meitnerium (old archiva host) [dns] - 10https://gerrit.wikimedia.org/r/458783 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [11:26:05] (03CR) 10Alexandros Kosiaris: [C: 031] Remove meitnerium (old archiva host) from puppet [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [11:31:06] 10Operations, 10Analytics, 10hardware-requests: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10akosiaris) The steps listed in the description look correct and sufficient to me. There is one thing to add and it would be the removal from DebMon... [11:49:56] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) [11:54:04] (03PS5) 10Elukey: Remove meitnerium (old archiva host) from puppet [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) [11:54:07] 10Operations, 10SRE-Access-Requests: Please add everyone on the performance team to perf-roots - https://phabricator.wikimedia.org/T202648 (10Peter) [11:54:11] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: request to add phedenskog to perf-roots - https://phabricator.wikimedia.org/T202658 (10Peter) 05Open>03Resolved Yep it worked, thanks and sorry for the delay. [11:54:55] (03CR) 10Elukey: [C: 032] Remove meitnerium (old archiva host) from puppet [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [11:55:23] (03CR) 10Gehel: Elasticsearch shard size check (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [11:58:09] (03CR) 10Gehel: [C: 031] "LGTM (does this require a specific version of the updater to be deployed before merging?)" [puppet] - 10https://gerrit.wikimedia.org/r/459831 (owner: 10Smalyshev) [11:59:05] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/459804 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:59:15] 10Operations, 10Analytics, 10hardware-requests: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10elukey) ``` elukey@puppetmaster1001:~$ sudo -i puppet node clean meitnerium.wikimedia.org Notice: Revoked certificate with serial 1595 Notice: Remo... [12:00:11] 10Operations, 10Analytics, 10hardware-requests: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10elukey) ``` elukey@neodymium:~$ sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/meitnerium.wikimedia.org --cert /etc/debmonitor/ssl/ce... [12:01:12] elukey: there is a decom script for that ;) [12:01:24] volans: I know I remembered it afterwards :( [12:01:27] (while reading the docs) [12:01:33] lol :) [12:01:48] np, just I was curious if the doc is misleading [12:01:57] given that I guessed you took the curl from there [12:02:45] yep yep [12:02:51] I was already half way through [12:03:05] (03CR) 10Gehel: dnsdisc: improve TTL checks (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:03:37] 10Operations, 10Analytics, 10hardware-requests: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10elukey) ``` elukey@ganeti1001:~$ sudo gnt-instance remove meitnerium.wikimedia.org This will remove the volumes of the instance meitnerium.wikimedi... [12:03:54] elukey: if you have any suggestion to improve it, please feel free to edit it ;) [12:03:59] (03PS2) 10Elukey: Decommission meitnerium (old archiva host) [dns] - 10https://gerrit.wikimedia.org/r/458783 (https://phabricator.wikimedia.org/T203087) [12:04:18] (03CR) 10Elukey: [C: 032] Decommission meitnerium (old archiva host) [dns] - 10https://gerrit.wikimedia.org/r/458783 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [12:04:56] volans: I'll update the part for the ganeti instance, but the docs looks great! it is only a matter of reading them before doing things :( [12:05:08] lol [12:05:16] thanks [12:06:24] 10Operations, 10Analytics, 10hardware-requests: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10elukey) [12:07:11] 10Operations, 10Analytics, 10hardware-requests: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10elukey) 05Open>03Resolved [12:07:46] !log delete meitnerium.wikimedia.org's ganeti VM (decommissioned) - T203087 [12:07:49] (03PS1) 10Bstorm: wiki replicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/460005 (https://phabricator.wikimedia.org/T174047) [12:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:54] T203087: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 [12:09:36] RECOVERY - Filesystem available is greater than filesystem size on ms-be2042 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2042&var-datasource=codfw%2520prometheus%252Fops [12:16:55] (03CR) 10Elukey: [C: 031] "Left some nits, looks good!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) (owner: 10Muehlenhoff) [12:17:40] !log repair sdm / sdj on ms-be2042 - T199198 [12:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:48] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [12:18:15] (03PS18) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [12:19:49] (03CR) 10Volans: "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:20:52] (03CR) 10Marostegui: [C: 032] wiki replicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/460005 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [12:23:05] !log Reload haproxy on dbproxy1010 to depool labsdb1010 - https://phabricator.wikimedia.org/T174047 [12:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:00] (03CR) 10Elukey: "I thought to be able to test this in labs but I forgot that my instance is not in deployment-prep, but in the analytics project, in which " [puppet] - 10https://gerrit.wikimedia.org/r/459994 (https://phabricator.wikimedia.org/T204074) (owner: 10Elukey) [12:24:22] 10Operations, 10Maps, 10Maps-Sprint, 10Reading-Infrastructure-Team-Backlog: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10Gehel) [12:25:14] (03PS1) 10Gehel: maps: decommission maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/460006 (https://phabricator.wikimedia.org/T202898) [12:30:41] (03CR) 10Muehlenhoff: maps: decommission maps-test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460006 (https://phabricator.wikimedia.org/T202898) (owner: 10Gehel) [12:30:51] (03CR) 10Volans: maps: decommission maps-test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460006 (https://phabricator.wikimedia.org/T202898) (owner: 10Gehel) [12:31:07] moritzm: you beat me :D [12:31:13] same comment :) [12:31:14] volans: by a few seconds only :-) [12:31:56] (03PS2) 10Gehel: maps: decommission maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/460006 (https://phabricator.wikimedia.org/T202898) [12:36:46] (03CR) 10Gehel: maps: decommission maps-test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460006 (https://phabricator.wikimedia.org/T202898) (owner: 10Gehel) [12:37:14] (03CR) 10Muehlenhoff: [C: 031] maps: decommission maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/460006 (https://phabricator.wikimedia.org/T202898) (owner: 10Gehel) [12:37:27] 10Operations, 10Maps-Sprint, 10Discovery-Search (Current work), 10Maps (Tilerator): Log slow queries on postgresql / maps - https://phabricator.wikimedia.org/T204106 (10Gehel) [12:37:51] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1110 with original load (duration: 00m 50s) [12:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:37] (03CR) 10Mark Bergsma: "A few remaining minor issues, should be good to go afterwards." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi) [12:43:46] (03PS1) 10Volans: sre.switchdc.mediawiki: update read only reason [cookbooks] - 10https://gerrit.wikimedia.org/r/460011 (https://phabricator.wikimedia.org/T199079) [12:44:10] (03PS1) 10Jcrespo: CommonSettings.php: Disable translation at centralnotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460012 (https://phabricator.wikimedia.org/T203925) [12:45:50] zeljkof, akosiaris ^ [12:46:15] (03CR) 10Marostegui: [C: 031] CommonSettings.php: Disable translation at centralnotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460012 (https://phabricator.wikimedia.org/T203925) (owner: 10Jcrespo) [12:46:42] jynus: is this FYI or I need to do something? [12:47:39] mostly FYI, but this make break stuff [12:47:48] will test on debug first [12:47:53] *may break [12:50:27] (03CR) 10Alexandros Kosiaris: [C: 031] CommonSettings.php: Disable translation at centralnotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460012 (https://phabricator.wikimedia.org/T203925) (owner: 10Jcrespo) [12:50:49] (03CR) 10Jcrespo: [C: 032] CommonSettings.php: Disable translation at centralnotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460012 (https://phabricator.wikimedia.org/T203925) (owner: 10Jcrespo) [12:50:58] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:51:11] maintenance? [12:51:56] (03CR) 10Alexandros Kosiaris: [C: 031] sre.switchdc.mediawiki: update read only reason [cookbooks] - 10https://gerrit.wikimedia.org/r/460011 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:52:05] (03Merged) 10jenkins-bot: CommonSettings.php: Disable translation at centralnotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460012 (https://phabricator.wikimedia.org/T203925) (owner: 10Jcrespo) [12:52:32] no, debmonitor [12:52:34] (03CR) 10jenkins-bot: CommonSettings.php: Disable translation at centralnotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460012 (https://phabricator.wikimedia.org/T203925) (owner: 10Jcrespo) [12:52:34] volans: ^ [12:52:36] jynus: that's harmless [12:52:39] (03CR) 10Gehel: [C: 031] "Looks good to me! Youhouhou!" [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [12:52:41] Session 157272 of user debmonitor [12:52:44] random, known issue in debmonitor [12:52:46] (03PS1) 10Marostegui: Revert "wiki replicas: depool labsdb1010 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/460013 [12:52:49] ok [12:53:03] the session which does the maintenance cron rarely fails [12:53:05] 10Operations: Add favicon to icinga and tendril - https://phabricator.wikimedia.org/T204110 (10Addshore) [12:53:26] akosiaris@ms-be1036:~$ sudo systemctl reset-failed [12:53:26] https://phabricator.wikimedia.org/T199911 [12:53:30] typically under load [12:53:31] technically is an issue in systemd+cron :D [12:53:34] should recover soon [12:53:34] but yeah known [12:54:18] RECOVERY - Check systemd state on ms-be1036 is OK: OK - running: The system is fully operational [12:54:35] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: update read only reason [cookbooks] - 10https://gerrit.wikimedia.org/r/460011 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:55:17] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: update read only reason [cookbooks] - 10https://gerrit.wikimedia.org/r/460011 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:55:43] marostegui or someone else: the change is on mwdebug1001, please help me test it [12:55:53] jynus: on my way! [12:58:23] jynus: I am browsing meta and everything looks fine so far [12:58:30] is there any specific thing you want me to test? [12:58:42] the banners seem ok [12:58:52] I am guessing only a fraction uses translation [12:59:13] I checked the main english projects and the main languages [12:59:25] but I cannot check manually the 900 projects, in all conditions [12:59:41] (logged in, logged out, from all contries, etc.) [13:00:35] zeljkof ok to deploy and pray, revert after switch? [13:01:18] jynus: the eswiki banner "más información" link points to Meta:Example [13:01:44] was it different than before? [13:01:59] It is the same when I am not in mwdebug [13:02:06] So I don't think it is related to this change [13:02:27] jynus: I'm not in charge, but my usual policy is to be ready for a revert in case of trouble [13:02:35] zeljkof: of course [13:02:44] I already assume that [13:02:51] https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/edit/ExpandedMaintenanceNotice_Mobile [13:02:56] [13:03:13] https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/edit/ExpandedMaintenanceNotice [13:03:16] [13:03:27] yes Dereckson :) [13:03:27] This link isn't a bug, but the current configuration of CentralNotice. [13:03:48] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Mvolz) [13:03:50] as meta admin I can fix that to point to the correct page if you tell me which is [13:04:06] So Dereckson, Hauskatze not related to my patch, right? [13:04:19] jynus: I don't think so [13:04:22] jynus: this is Special:CentralNotice [13:04:42] Hauskatze: I'd say https://meta.wikimedia.org/wiki/Tech/Server_switch_2018 [13:05:06] fixing [13:05:24] Yes, that's the best link. Thanks. [13:05:40] I am just disabling some config, not changing content [13:05:51] aka altering the banners themselves [13:06:27] ok, deploying everywhere [13:06:38] fixed on Special:CentralNoticeBanners [13:06:59] Works on meta (en) [13:07:24] but it needs to be updated for every language, doesn't it? [13:07:55] its cached [13:08:02] so yep, it needs to update [13:08:10] !log jynus@deploy1001 Synchronized wmf-config/CommonSettings.php: Disabling CentralNotice translations (duration: 00m 50s) [13:08:14] thankfully there's nothing manual to do for that to happen [13:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:37] Hauskatze: check if some banner is now wrong or has issues [13:08:37] ...and I wanted to fix a crappy translation, no longer :) [13:08:50] I disabled some functionality related to centralnotice [13:09:18] In the meanwhile, as a very ugly solution, I copied the relevant information to Meta:Example. [13:09:32] (03PS1) 10Jcrespo: Revert "CommonSettings.php: Disable translation at centralnotice" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460014 [13:10:04] saw that JohanJ [13:10:14] there is high exceptions on commons [13:10:16] I hope the banner config gets refreshed soon [13:11:03] around 120 per minute [13:11:06] !log otto@deploy1001 Started deploy [eventlogging/analytics@5c6fab6]: Support loading plugins in eventlogging-processor - T203596 [13:11:11] !log otto@deploy1001 Finished deploy [eventlogging/analytics@5c6fab6]: Support loading plugins in eventlogging-processor - T203596 (duration: 00m 05s) [13:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:13] T203596: Flip blacklist for MySQL eventlogging consumer to be a whilelist of allowed schemas - https://phabricator.wikimedia.org/T203596 [13:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:20] marostegui, akosiaris revert? [13:11:27] (03CR) 10Gehel: [C: 031] dnsdisc: improve TTL checks (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:11:27] one sec [13:12:00] 10Operations, 10Traffic: certcentral: challenge checking on *all* pooled backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) [13:12:02] it seems now low [13:12:07] jynus: according to fatals it has recovered now [13:12:12] So let's give it a couple of minutes [13:12:18] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/459805 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:12:20] to see if it gets stable [13:12:30] Transaction spent 5.6564915180206 second(s) in writes, exceeding the limit of 3. [13:12:51] (03PS11) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) [13:12:54] it was api call, not relevant [13:13:00] eswiki link now fixed as well [13:13:08] &action=purge magic [13:13:08] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) > From the SRE/ops point of view we would like to either completely leave it as it is or completely transfer the domain and delete it from all our confi... [13:13:20] things are now semi-ok [13:13:38] please shout if you see something our of the ordinary, like Hauskatze just did [13:13:44] *out [13:13:57] 10Operations, 10Traffic: gdnsd plugin support for ACME DNS challenges - https://phabricator.wikimedia.org/T194965 (10Krenair) Status: @bblack has written support into gdnsd in https://github.com/gdnsd/gdnsd/commit/db7fff10b005b951890fa4ff7c843a1e37bbdc58 (as well as a follow up or two) and I've made https://ge... [13:14:01] JohanJ: I'll delete the meta:example page [13:14:04] it's fixed [13:14:10] Sounds good. [13:14:22] Although, it did exist before. [13:15:57] So I just reverted it back to what it was. [13:17:17] JohanJ: right, thanks :) [13:17:40] @JohanJ & @Hauskatze apologies for that [13:18:01] (03CR) 10Alex Monk: [WIP] Central certificates service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [13:18:35] just in case: https://meta.wikimedia.org/?diff=18375236 [13:19:04] 10Operations, 10Traffic, 10HTTPS: letsencrypt puppetization: add parallel rsa+ecdsa cert support - https://phabricator.wikimedia.org/T141266 (10Krenair) I don't know if we're going to end up doing this in the current letsencrypt puppetisation, but it's mostly there certcentral. Only thing is my puppetisation... [13:22:27] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: letsencrypt puppetization: upgrade for scalability - https://phabricator.wikimedia.org/T134447 (10Krenair) are we going to do this as part of the letsencrypt puppetisation or is this getting made (mostly?) obsolete by certcentral? [13:22:37] (03CR) 10Bstorm: [C: 031] Revert "wiki replicas: depool labsdb1010 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/460013 (owner: 10Marostegui) [13:23:06] (03CR) 10Marostegui: [C: 032] Revert "wiki replicas: depool labsdb1010 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/460013 (owner: 10Marostegui) [13:24:20] !log Reload haproxy on dbproxy1010 to repool labsdb1010 - T174047 [13:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:28] T174047: Hide deprecated/unused fields on toolforge replica [MCR] - https://phabricator.wikimedia.org/T174047 [13:24:45] (03CR) 10Ottomata: [C: 032] Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [13:24:51] (03PS12) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) [13:24:55] (03CR) 10Ottomata: [V: 032 C: 032] Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [13:27:18] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 660.38 seconds [13:27:29] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 670.89 seconds [13:28:20] ^ we are on that [13:30:33] Informed eswiki VP that the DC switch is happening in 30' aprox. [13:30:37] --> nap [13:35:06] (03PS1) 10Bstorm: wiki replicas: depool labsdb1009 to run view updates [puppet] - 10https://gerrit.wikimedia.org/r/460016 (https://phabricator.wikimedia.org/T174047) [13:35:16] 10Operations, 10Maps, 10Maps-Sprint, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10Gehel) [13:42:12] (03PS1) 10Marostegui: db-codfw.php: Depool db2054 and db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460018 [13:42:48] nl.wp down with 503 for me. [13:43:10] thedj: works for me [13:43:16] If you report this error to the Wikimedia System Administrators, please include the details below. [13:43:18] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:43:19] Request from 212.178.66.253 via cp1081 cp1081, Varnish XID 242386225 [13:43:21] jynus: check the patch above just in case [13:43:22] Error: 503, Backend fetch failed at Wed, 12 Sep 2018 13:42:54 GMT [13:43:24] https://commons.wikimedia.org/wiki/User_talk:DriaThornton [13:43:31] Error: 503, Backend fetch failed at Wed, 12 Sep 2018 13:41:39 GMT [13:43:33] yeah there's been a error spike [13:43:36] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=All&var-status_type=5&from=now-1h&to=now [13:43:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:43:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:44:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:44:33] is this related to the planned maintenance? [13:44:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:44:40] no [13:45:18] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:45:19] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [13:45:19] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:45:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:45:48] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [13:45:49] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [13:45:58] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:46:39] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:46:57] (03Abandoned) 10Thcipriani: Scap: Update config to use PHP=hhvm -vEval.Jit=1 [puppet] - 10https://gerrit.wikimedia.org/r/459828 (https://phabricator.wikimedia.org/T203680) (owner: 10Thcipriani) [13:46:57] I'm around. is there anything I can check or do? [13:46:59] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:47:04] the whole wiki is down to me [13:48:06] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10thcipriani) [13:48:09] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Scap should use Eval.Jit=1 when calling rebuildLocalisationCache.php via HHVM - https://phabricator.wikimedia.org/T203680 (10thcipriani) 05Open>03Invalid Will try to move to php7.0 per discussion on T191921 [13:48:31] Request from **** via cp1081 cp1081, Varnish XID 233637540 Error: 503, Backend fetch failed at Wed, 12 Sep 2018 13:46:01 GMT [13:49:02] Amir1: Ops are aware and looking at it :) [13:49:48] kk [13:51:13] "Error: 503, Backend fetch failed at Wed, 12 Sep 2018 13:50:52 GMT" [13:51:26] ah k [13:51:50] The mobile site works but strange that the desktop site dosent [13:52:29] i knew those EU laws would be bad, but this is a bit quick to bring down the gauntlet ;) [13:52:50] !log restart varnish-be on cp1081 due to mbox lag and 503s [13:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:39] (03PS1) 10Thcipriani: Scap: Update config to use PHP=php7.0 [puppet] - 10https://gerrit.wikimedia.org/r/460021 (https://phabricator.wikimedia.org/T191921) [13:56:13] seems normal again. Thank you people. Good luck with the switchover ! [13:56:24] fyi, we figured out the cause (misbehaving varnish). Should be fixed now. [13:56:46] cool. thanks akosiaris [13:56:50] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2054 and db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460018 (owner: 10Marostegui) [13:57:48] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:57:58] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:58:09] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:58:11] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2054 and db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460018 (owner: 10Marostegui) [13:58:29] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:58:39] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:58:39] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:58:45] cool, recoveries from icinga as well [13:58:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:59:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:59:25] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2054 and db2068 looks like they are stuck (duration: 00m 50s) [13:59:28] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:01] jouncebot: next [14:00:01] In 1 hour(s) and 59 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180912T1600) [14:00:05] Deploy window Datacenter Switchover - MediaWiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180912T1400) [14:00:13] wheeee [14:01:32] nice timing [14:03:20] !log stopping db2068 [14:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:14] !log starting switchover meny "cookbook sre.switchdc.mediawiki eqiad codfw" on sarin - T203777 [14:04:20] *menu [14:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:21] T203777: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from eqiad - https://phabricator.wikimedia.org/T203777 [14:04:32] <_joe_> volans: ahahahah [14:05:44] proceeding with the first preparatory steps [14:05:48] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [14:05:48] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [14:05:49] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:06:21] !log START - Cookbook sre.switchdc.mediawiki.00-disable-puppet (volans@sarin) [14:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:28] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) (volans@sarin) [14:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:49] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:07:29] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [14:07:42] !log START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (volans@sarin) [14:07:45] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) (volans@sarin) [14:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:00] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10Halfak) With the wikitext slot, we won't know which note relates to which judgement. This is like having one big "not... [14:12:45] (03CR) 10jenkins-bot: db-codfw.php: Depool db2054 and db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460018 (owner: 10Marostegui) [14:13:11] !log starting replication on db2068 [14:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:03] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10Dzahn) That's partially just our configuration though since we make all SREs a contact for all services. We could as well add proper host groups an... [14:17:02] jynus: did language engineering sign off on that particular way to disable Translate? [14:17:52] if they did, they may've forgotten how CentralNotice is written because from a quick glance (I don't know either of those two code bases well), it seems like this could cause some issues. For one, it means the code no longer recognises its own content. [14:17:57] This page is now 404 - https://meta.wikimedia.org/wiki/CNBanner:WMNL_WomenTechStorm2_2018-text/nb [14:18:10] because page_namespace ID is registered in that conditional. [14:18:29] hey Krinkle, thanks for looking into it! jynus etc. are debugging another issue right now [14:19:06] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/460012/ [14:20:48] paravoid: probably same issue [14:21:17] or related I guess [14:28:22] (03PS1) 10Ottomata: Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) [14:28:41] ottomata: no puppet merges please, swtichdc time ;) [14:29:04] (03CR) 10jerkins-bot: [V: 04-1] Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) (owner: 10Ottomata) [14:29:17] resuming swtichdc [14:29:19] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.11 seconds [14:29:29] !log START - Cookbook sre.switchdc.mediawiki.00-wipe-and-warmup-caches (volans@sarin) [14:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:36] volans: no worries, mostly putting that there to run compiler [14:31:47] great! thx [14:32:04] (03PS1) 10Marostegui: db-codfw.php: Repool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460024 [14:32:26] (03CR) 10Marostegui: [C: 04-1] "Only repool if needed - this host might need recloning" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460024 (owner: 10Marostegui) [14:33:13] (03PS2) 10Ottomata: Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) [14:33:29] PROBLEM - HHVM rendering on mw2174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:30] PROBLEM - HHVM rendering on mw2220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:30] PROBLEM - HHVM rendering on mw2210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:48] somewhat expected, they didn't alert the other day during the live-test though [14:33:51] (03CR) 10jerkins-bot: [V: 04-1] Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) (owner: 10Ottomata) [14:34:02] !log stopping mariadb at db2054 [14:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:29] RECOVERY - HHVM rendering on mw2174 is OK: HTTP OK: HTTP/1.1 200 OK - 75336 bytes in 0.328 second response time [14:34:29] RECOVERY - HHVM rendering on mw2220 is OK: HTTP OK: HTTP/1.1 200 OK - 75336 bytes in 0.400 second response time [14:34:29] RECOVERY - HHVM rendering on mw2210 is OK: HTTP OK: HTTP/1.1 200 OK - 75336 bytes in 0.436 second response time [14:34:32] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10Halfak) For clarity, here's a rough version of the endorsements proposal that I'd originally put together about a year... [14:34:59] PROBLEM - HHVM rendering on mw2202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:34:59] PROBLEM - HHVM rendering on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:34:59] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 29 probes of 318 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [14:35:51] they are recovering fast though, so probably ok [14:35:58] RECOVERY - HHVM rendering on mw2202 is OK: HTTP OK: HTTP/1.1 200 OK - 75336 bytes in 0.383 second response time [14:35:59] RECOVERY - HHVM rendering on mw2251 is OK: HTTP OK: HTTP/1.1 200 OK - 75336 bytes in 0.426 second response time [14:36:01] (03PS3) 10Ottomata: Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) [14:36:29] PROBLEM - HHVM rendering on mw2222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:36:37] (03CR) 10jerkins-bot: [V: 04-1] Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) (owner: 10Ottomata) [14:37:29] RECOVERY - HHVM rendering on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 75336 bytes in 0.432 second response time [14:39:02] (03PS4) 10Ottomata: Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) [14:40:12] (03PS5) 10Ottomata: Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) [14:40:32] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-wipe-and-warmup-caches (exit_code=0) (volans@sarin) [14:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:27] !log START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (volans@sarin) [14:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:35] (03PS6) 10Ottomata: Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) [14:41:42] !log END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) (volans@sarin) [14:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:26] (03PS2) 10Alexandros Kosiaris: cache::text: switch mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/458772 (https://phabricator.wikimedia.org/T203776) [14:42:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] cache::text: switch mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/458772 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [14:44:05] about to go read-only [14:44:14] !log START - Cookbook sre.switchdc.mediawiki.02-set-readonly (volans@sarin) [14:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:24] !log MediaWiki read-only period starts at: 2018-09-12 14:44:24.536913 (volans@sarin) [14:44:25] !log END (FAIL) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=99) (volans@sarin) [14:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:48] all seems good [14:44:50] although is failed [14:44:53] <_joe_> volans: ? [14:45:02] one host didn't have the KeyError: 'readonlyreason' [14:45:14] <_joe_> one host?? [14:45:16] et me run it again [14:45:17] <_joe_> what do you mean? [14:45:20] <_joe_> oh I see now [14:45:21] it's idempotent [14:45:23] <_joe_> ok leave it [14:45:25] while checking [14:45:40] etcd is correct [14:45:45] akosiaris: ok to proceed? [14:46:02] <_joe_> wikis are in read only from my tests [14:46:03] enwiki confirmed on read only [14:46:05] <_joe_> let's proceed [14:46:08] <_joe_> itwiki too [14:46:14] good [14:46:22] dbs now? [14:46:26] yes [14:46:33] go for it [14:46:35] !log START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (volans@sarin) [14:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:43] s1 eqiad enwiki master confirmed read only on mysql level [14:47:01] checking them in sync [14:47:03] !log END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) (volans@sarin) [14:47:05] all good [14:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:09] let's wait for the confirmation ^there [14:47:17] ready to switch mediawiki [14:47:22] go for it [14:47:24] <_joe_> let's proceed [14:47:28] go [14:47:30] !log START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (volans@sarin) [14:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:51] !log END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) (volans@sarin) [14:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:02] ready to switch traffic [14:48:03] <_joe_> let's switch traffic now [14:48:07] !log START - Cookbook sre.switchdc.mediawiki.04-switch-traffic (volans@sarin) [14:48:08] +1 [14:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:27] * volans waiting for puppet [14:48:54] almost no read only log errors, which is good [14:48:57] <_joe_> yes, this is the longest phase of the switchover [14:49:02] like 15 seconds [14:49:06] *15 logs [14:49:07] who is running puppet? [14:49:10] expect ~3min [14:49:12] switchdc is [14:49:13] <_joe_> cumin [14:49:14] the script [14:49:15] ok [14:49:16] <_joe_> :P [14:49:23] 4 answers, 4 different answers ... [14:49:25] <_joe_> ahahah we gave 3 different answers, all correct [14:49:29] haha I was about to comment on that [14:49:32] just 1 silly question [14:49:44] noone responded "spicerack" though [14:49:51] or 'volans' [14:49:51] you just did [14:49:56] I guess I did! [14:50:14] <_joe_> mark: volans is running the script, yes, but we consider it a "what", not a "who" [14:50:14] running puppet in eqiad [14:50:18] <_joe_> he's clearly a linter bot [14:50:20] lol [14:50:21] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10Cmjohnson) Swapped the disk in sodium return shipping info USPS 9202 3946 5301 2439 6565 62 FEDX 9611918 2393026 76406583 [14:50:23] volans's answer was the most surprising one [14:50:24] !log END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-traffic (exit_code=0) (volans@sarin) [14:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:28] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 24 probes of 318 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [14:50:31] marostegui: check for overloads [14:50:31] ready for redis [14:50:34] ok traffic switched [14:50:36] <_joe_> go for redis [14:50:39] go for redis [14:50:41] <_joe_> it makes sense to do it now [14:50:44] !log START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (volans@sarin) [14:50:47] !log END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0) (volans@sarin) [14:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:48] <_joe_> even if we want to verify traffic [14:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:58] ready for 06-set-db-readwrite [14:51:05] (03PS7) 10Ottomata: Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) [14:51:12] some db errors, but not many so far [14:51:20] so far we are fine [14:51:24] and going down [14:51:24] akosiaris: ? [14:51:26] volans: ok, go [14:51:27] <_joe_> volans: green [14:51:30] !log START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (volans@sarin) [14:51:32] !log END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) (volans@sarin) [14:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:35] s1 codfw master read only off on mysql level, confirmed [14:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:43] ready for 07-set-readwrite [14:51:45] and now mediawiki. go [14:51:49] !log START - Cookbook sre.switchdc.mediawiki.07-set-readwrite (volans@sarin) [14:51:50] +1 [14:51:51] <_joe_> go with mediawiki [14:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:59] !log MediaWiki read-only period ends at: 2018-09-12 14:51:58.936291 (volans@sarin) [14:51:59] !log END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) (volans@sarin) [14:52:01] errors on s1 mostly, not on s7 [14:52:03] done [14:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:09] (But still low) [14:52:12] I can edit [14:52:16] ok we are out of the ro [14:52:26] enwiki taking long for me to save an edit [14:52:28] but it did [14:52:36] cool! [14:52:41] yes pretty slow [14:52:41] some lag complains [14:52:43] have been trailing 5xx.json on oxygen, log output went really up during the switch and now normal again [14:53:05] but it was a spike getting behind [14:53:09] eswiki works fine saving an edit [14:53:22] s7 is doing fine so far [14:53:26] VE edit on dewiki workded fine [14:53:27] performance issues we should be checking [14:53:28] RO: 14:51:58.936291 - 14:44:24.536913 [14:53:43] that is probably the higest concern for me [14:53:52] el wiki worked fine [14:53:56] volans: that doesn't sound right [14:54:03] END - START [14:54:08] was to calculate the DELTA [14:54:09] :D [14:54:16] heh [14:54:23] (03PS8) 10Ottomata: Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) [14:54:27] a spike of job errors [14:54:38] maybe too much concurrency at the beginning? [14:54:48] PROBLEM - Check health of redis instance on 6380 on rdb1001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3240 keys, up 181 days 4 hours [14:55:09] ^? [14:55:18] jynus: 17:54 < _joe_> oh we have a bunch of redis alarms we should all disregard [14:55:18] PROBLEM - Check health of redis instance on 6381 on rdb1001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3463 keys, up 181 days 4 hours [14:55:19] PROBLEM - Check health of redis instance on 6378 on rdb1001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 15 keys, up 181 days 4 hours [14:55:22] paravoid: thanks [14:55:25] didn't saw that [14:55:28] PROBLEM - Check health of redis instance on 6379 on rdb1001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705935 keys, up 181 days 4 hours [14:55:32] (03CR) 10jerkins-bot: [V: 04-1] Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) (owner: 10Ottomata) [14:55:35] np :) [14:55:53] (03PS9) 10Ottomata: Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) [14:56:00] quite a lot of errors due to lag on jobs [14:56:07] _joe_: those errors on eqiad redises? [14:56:11] yeah it's fine. I was thinking about downtiming them and totally forgot [14:56:14] <_joe_> volans: just ignore them [14:56:19] PROBLEM - Check health of redis instance on 6378 on rdb1003 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4705606 keys, up 181 days 3 hours [14:56:19] PROBLEM - Check health of redis instance on 6378 on rdb1007 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4 keys, up 181 days 2 hours [14:56:19] PROBLEM - Check health of redis instance on 6379 on rdb1007 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3186 keys, up 181 days 2 hours [14:56:19] PROBLEM - Check health of redis instance on 6381 on rdb1003 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4604373 keys, up 181 days 3 hours [14:56:19] PROBLEM - Check health of redis instance on 6380 on rdb1003 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4705813 keys, up 181 days 3 hours [14:56:20] ok [14:56:27] let's add them to the ones to be silenced [14:56:29] PROBLEM - Check health of redis instance on 6381 on rdb1007 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 2865 keys, up 181 days 2 hours [14:56:29] PROBLEM - Check health of redis instance on 6379 on rdb1003 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705883 keys, up 181 days 3 hours [14:56:37] we may have a lot of concurrency on jobs [14:56:46] and may need to tune that down next time [14:57:01] as we suffer due to those delayed ones + cold dbs [14:57:08] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:57:32] <_joe_> jynus: should we ask mobrovac to tune it down now? [14:57:36] that's ^ due to the spike [14:57:45] it's already fine [14:57:55] let me know when we're confortable with the other steps [14:58:07] jynus: do you want me to update tendril? [14:58:12] I guess it might be useful [14:58:16] it actually took it 3 mins after the event subsided to alert [14:58:17] need help? [14:58:18] bad alert [14:58:22] to have it earlier than later [14:58:42] volans: yeah, go for the tendril update [14:58:46] volans: please do [14:58:59] !log START - Cookbook sre.switchdc.mediawiki.08-update-tendril (volans@sarin) [14:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:14] !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-update-tendril (exit_code=0) (volans@sarin) [14:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:19] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:59:19] so mostly db errors with organic weight- mostly enwiki [14:59:21] done [14:59:33] jynus: yeah, and no lag apparently [14:59:50] yeah, but too much concurrency on masters from job queue on start [14:59:54] we need to tune that down [15:00:17] (03PS10) 10Ottomata: Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) [15:00:21] reduce rate for switch even if it creates processing lag [15:02:00] (03CR) 10Ottomata: "Ok, ready for review. First refinery 0.0.74 must be deployed (with RefineMonitor changes), and this should work! Gonna be a little messy" [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) (owner: 10Ottomata) [15:02:42] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.51 seconds [15:02:52] akosiaris: are we ok with the rest? [15:02:57] (03PS1) 10EBernhardson: Cirrus default cluster should be "local" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460028 [15:03:03] maintenance and parsoid [15:03:08] ^that is the db2054 recovery [15:03:21] volans: wanna do 08-restore-ttl ? [15:03:26] let's leave parsoid for last [15:03:27] <_joe_> prioritize maintenance, please [15:03:35] ok [15:03:38] that too [15:03:38] ok 08-start-maintenance [15:03:42] !log START - Cookbook sre.switchdc.mediawiki.08-start-maintenance (volans@sarin) [15:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:56] (03CR) 10DCausse: [C: 031] Cirrus default cluster should be "local" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460028 (owner: 10EBernhardson) [15:04:08] hi [15:04:16] hello [15:04:18] Was the earlier technical issue resolved? [15:04:28] which one of all ? [15:04:28] which earlier technical issue? [15:04:30] :) [15:04:43] ShakespeareFan00: we are in the middle of a switchover [15:04:49] From? [15:04:53] so pretty much anything you saw was expected [15:04:59] eqiad to codfw [15:04:59] (So planned upgrades?) [15:04:59] ShakespeareFan00, they're doing a datacentre switchover [15:05:00] !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) (volans@sarin) [15:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:05] Krenair: Ah [15:05:06] please read the central notice [15:05:08] they're kind of busy [15:05:10] ShakespeareFan00, https://meta.wikimedia.org/wiki/Tech/Server_switch_2018 [15:05:17] So not a time to do any massive edit runs [15:05:25] take chatter to -tech [15:05:26] ShakespeareFan00: come to #wikimedia-tech for discussion about it [15:05:28] definitely not ;) [15:05:36] ShakespeareFan00: no [15:05:47] !log START - Cookbook sre.switchdc.mediawiki.08-restore-ttl (volans@sarin) [15:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:57] !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) (volans@sarin) [15:05:57] Krenait: Noted and sorry [15:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:25] teh icinga warnings for confd whould auto-resolve now [15:06:28] Wikidata dispatching just started back up [15:06:42] and appears to be running just fine :) [15:06:51] akosiaris, _joe_, mobrovac: only parsoid restart left, is it still needed? [15:06:51] cool [15:07:16] let's go ahead with it volans, please [15:07:26] volans: you head the man :-) [15:07:28] <_joe_> volans: yes it is, but lemme check 1 thing please [15:07:29] heard* [15:07:29] i.e. yes, restart :) [15:07:36] _joe_: ack waiting [15:07:46] that's the thing that takes ~45mins, right ? [15:07:54] more ~15 [15:07:59] but yes is slow [15:08:03] <_joe_> go on volans [15:08:05] heh, ok [15:08:06] ack [15:08:06] jouncebot, next [15:08:06] In 0 hour(s) and 51 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180912T1600) [15:08:15] !log START - Cookbook sre.switchdc.mediawiki.08-restart-parsoid (volans@sarin) [15:08:16] <_joe_> akosiaris: it's completely harmless, btw [15:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:23] <_joe_> the parsoid restart [15:08:25] it's a batch 1 restart [15:08:28] obviously [15:08:30] ok [15:08:33] <_joe_> it will just unbreak wikitech AFAICS [15:09:02] will ? why ? [15:09:29] there's some bug relating to parsoid's config for wikitech during the switchover [15:09:48] https://phabricator.wikimedia.org/T163438 [15:09:53] sidenote: are we collecting bugs anywhere? [15:10:07] Krenair: yeah I know about that task, I am not sure why yet though [15:10:08] <_joe_> we have a tag, right? [15:10:11] https://phabricator.wikimedia.org/project/view/3571/ has two open ones [15:10:14] (and other unforeseen issues) [15:10:19] paravoid: ^ [15:10:36] the one with the nice server icon :P [15:10:38] (03PS1) 10Marostegui: db-codfw.php: Tweak weights in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460030 [15:10:47] alright, fair enough [15:11:00] <_joe_> akosiaris: https://phabricator.wikimedia.org/T163438#4576283 [15:11:15] yup, reading it now [15:11:22] PROBLEM - High CPU load on API appserver on mw2141 is CRITICAL: CRITICAL - load average: 57.16, 44.03, 29.59 [15:11:28] <_joe_> oh nice of course [15:11:38] dammit [15:11:51] PROBLEM - High CPU load on API appserver on mw2138 is CRITICAL: CRITICAL - load average: 60.76, 40.81, 26.21 [15:11:52] tagged it #operations [15:11:59] <_joe_> this seems serious [15:12:01] 10Operations, 10Cloud-Services, 10Parsing-Team, 10Datacenter-Switchover-2018, and 2 others: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." - https://phabricator.wikimedia.org/T163438 (10faidon) [15:12:17] (03PS1) 10Jcrespo: mariadb: Distribute better weight among s1 servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460031 [15:12:25] (03CR) 10Marostegui: [C: 032] db-codfw.php: Tweak weights in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460030 (owner: 10Marostegui) [15:12:32] (03CR) 10Jcrespo: [C: 031] db-codfw.php: Tweak weights in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460030 (owner: 10Marostegui) [15:12:42] jynus: I can merge your change too and deploy at the same time [15:12:43] api [15:13:03] <_joe_> akosiaris: it's just 3 servers [15:13:04] multiple api servers are >=7% CPU [15:13:04] (03CR) 10Marostegui: [C: 031] mariadb: Distribute better weight among s1 servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460031 (owner: 10Jcrespo) [15:13:28] <_joe_> and some of the oldest ones [15:13:29] mw2141, mw2142, mw2138, mw2145 [15:13:32] (03CR) 10Jcrespo: [C: 032] mariadb: Distribute better weight among s1 servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460031 (owner: 10Jcrespo) [15:13:36] <_joe_> I suggest lowering their weight in pybal [15:13:47] jynus: I will deploy your change [15:13:48] (03Merged) 10jenkins-bot: db-codfw.php: Tweak weights in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460030 (owner: 10Marostegui) [15:13:54] <_joe_> akosiaris: should I do it? [15:13:59] _joe_: ok, if this isn't a stuck HHVM, fine by me [15:14:00] (03PS2) 10Jcrespo: mariadb: Distribute better weight among s1 servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460031 [15:14:04] sure go ahead [15:14:09] marostegui: it is rebasing [15:14:12] <_joe_> akosiaris: I'll check one for being sure [15:14:23] marostegui: db2057 issues [15:14:25] yeah, I will wait for the merge and then deploy you and mine at the same time [15:14:34] <_joe_> !log depooling mw2141 for investigation [15:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:58] is it the s3 master? [15:15:02] jynus: looks like it has recovered, it is a slave [15:15:12] <_joe_> akosiaris: yes, confirmed, just load [15:15:13] weird [15:15:26] <_joe_> !log repool mw2141 [15:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:30] parsoid restart half way through [15:16:32] RECOVERY - High CPU load on API appserver on mw2141 is OK: OK - load average: 13.41, 31.91, 29.26 [15:17:16] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Tweak weights in s1 and s2 (duration: 00m 50s) [15:17:21] jynus: ^ [15:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:38] Any reason not to take down the banner now? [15:18:42] PROBLEM - High CPU load on API appserver on mw2145 is CRITICAL: CRITICAL - load average: 71.73, 50.13, 35.72 [15:18:45] (CentralNotice.) [15:19:23] akosiaris ^^^ I think we can take it down [15:19:28] !log oblivian@sarin conftool action : set/weight=20; selector: dc=codfw,cluster=api_appserver,service=apache2,name=mw22[2-9].* [15:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:06] JohanJ: em yeah. wait a bit we are reverting the translation disabling we did just prior to the switchover [15:20:12] will take a couple of mins [15:20:12] <_joe_> this should probably solve the load issues on codfw [15:20:14] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10Cmjohnson) [15:20:19] <_joe_> APIs [15:20:29] akosiaris: OK. [15:21:14] (03CR) 10Elukey: [C: 031] Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) (owner: 10Ottomata) [15:21:21] (03CR) 10Alexandros Kosiaris: [C: 031] Revert "CommonSettings.php: Disable translation at centralnotice" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460014 (owner: 10Jcrespo) [15:22:09] !log oblivian@sarin conftool action : set/weight=15; selector: dc=codfw,cluster=api_appserver,service=apache2,name=mw21.* [15:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:51] PROBLEM - HHVM rendering on mw2137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:42] RECOVERY - HHVM rendering on mw2137 is OK: HTTP OK: HTTP/1.1 200 OK - 75346 bytes in 0.294 second response time [15:24:29] (03PS3) 10Alexandros Kosiaris: db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/458787 (https://phabricator.wikimedia.org/T203776) [15:24:42] !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-parsoid (exit_code=0) (volans@sarin) [15:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:13] (03PS1) 10Arlolra: Set $wgSiteMatrixNonGlobalSites global for SiteMatrix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460034 [15:25:13] mobrovac: got 2 warnings while restarting parsoid, on wtp1026: [15:25:14] Warning: error while checking the pooled state on 208.80.154.139:9090: Net::ReadTimeout │············ [15:25:22] and on wtp2007.codfw.wmnet: Warning: error while checking the pooled state on 10.192.17.6:9090: Net::ReadTimeout [15:25:32] but the restart exited with 0 exit code [15:25:40] that's conftool that's busy it seems [15:25:42] _joe_: ^ ? [15:26:02] (03PS1) 10RobH: set policy.wikimedia.org check to LE threshholds [puppet] - 10https://gerrit.wikimedia.org/r/460035 (https://phabricator.wikimedia.org/T201695) [15:26:22] otherwise, that's ok volans, these should be automatically repooled by the pooling check [15:26:52] mobrovac: ack, so no manual intervention needed [15:26:53] thanks! [15:26:57] (03CR) 10Jcrespo: [C: 032] Revert "CommonSettings.php: Disable translation at centralnotice" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460014 (owner: 10Jcrespo) [15:26:58] Is there anything goign on where I cannot merge a monitoring update? [15:27:13] just changing policy.w.o check to LE interval. [15:27:27] robh: no, I think it's fine [15:27:41] (03CR) 10RobH: [C: 032] set policy.wikimedia.org check to LE threshholds [puppet] - 10https://gerrit.wikimedia.org/r/460035 (https://phabricator.wikimedia.org/T201695) (owner: 10RobH) [15:27:43] (03CR) 10jenkins-bot: db-codfw.php: Tweak weights in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460030 (owner: 10Marostegui) [15:27:45] (03CR) 10jenkins-bot: mariadb: Distribute better weight among s1 servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460031 (owner: 10Jcrespo) [15:27:56] akosiaris: thx! [15:28:12] (03Merged) 10jenkins-bot: Revert "CommonSettings.php: Disable translation at centralnotice" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460014 (owner: 10Jcrespo) [15:28:17] (03PS1) 10Marostegui: db-codfw.php: Change weights in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460036 [15:28:21] <_joe_> mobrovac: that's pybal, not conftool [15:28:25] (03CR) 10jenkins-bot: Revert "CommonSettings.php: Disable translation at centralnotice" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460014 (owner: 10Jcrespo) [15:28:47] mutante: good catch about policy.w.o =] fixing [15:29:03] robh: :) [15:29:30] added you to reviewers not so much for the change, its easy command line [15:29:37] but just so you were aware i saw your find and fixed! [15:30:06] this is not urgent at all, so let's talk about it later please [15:30:12] sorry. [15:30:14] just trying to minimize the chatter :) [15:30:26] (03CR) 10Marostegui: [C: 032] db-codfw.php: Change weights in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460036 (owner: 10Marostegui) [15:31:42] (03Merged) 10jenkins-bot: db-codfw.php: Change weights in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460036 (owner: 10Marostegui) [15:31:44] (03CR) 10Marostegui: [V: 032 C: 032] db-codfw.php: Change weights in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460036 (owner: 10Marostegui) [15:31:52] (03PS1) 10Cmjohnson: adding mgmt dns for new host stat1007 [dns] - 10https://gerrit.wikimedia.org/r/460037 (https://phabricator.wikimedia.org/T203852) [15:32:27] 10Operations, 10ops-codfw, 10fundraising-tech-ops: move/setup/install frauth2001.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Papaul) [15:32:57] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Decrease weight for db2070 (duration: 00m 49s) [15:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:47] RECOVERY - High CPU load on API appserver on mw2145 is OK: OK - load average: 19.21, 22.75, 29.54 [15:35:16] RECOVERY - Host backup2001 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [15:37:24] (03PS1) 10Papaul: DNS: Add mgmt DNS for frauth2001 and remove old asset tag entries [dns] - 10https://gerrit.wikimedia.org/r/460042 (https://phabricator.wikimedia.org/T204079) [15:38:23] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: 43b98e414 (duration: 00m 50s) [15:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:37] volans: fyi, i repooled the two wtp hosts manually [15:40:38] mobrovac: ack, thanks, so they needed some manual intervention in the end :( [15:40:47] (03CR) 10Cmjohnson: [C: 032] adding mgmt dns for new host stat1007 [dns] - 10https://gerrit.wikimedia.org/r/460037 (https://phabricator.wikimedia.org/T203852) (owner: 10Cmjohnson) [15:41:25] RECOVERY - High CPU load on API appserver on mw2138 is OK: OK - load average: 16.54, 17.96, 29.47 [15:42:46] (03CR) 10jenkins-bot: db-codfw.php: Change weights in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460036 (owner: 10Marostegui) [15:44:23] (03PS4) 10Ayounsi: Icinga: add check_vcp (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) [15:44:25] (03CR) 10Ayounsi: Icinga: add check_vcp (part 1) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi) [15:45:44] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10Cmjohnson) [15:46:13] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10Cmjohnson) a:05Cmjohnson>03RobH @robh all yours to finish installations [15:48:01] (03PS1) 10Marostegui: db-codfw.php: Decrease weight for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460043 [15:49:08] (03CR) 10Jcrespo: [C: 032] db-codfw.php: Decrease weight for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460043 (owner: 10Marostegui) [15:49:27] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407 (10Cmjohnson) [15:49:30] 10Operations, 10ops-eqiad: rename/reimage labnodepool1002.eqiad.wmnet as cloudservices1003.wikimedia.org - https://phabricator.wikimedia.org/T201439 (10Cmjohnson) 05Open>03Resolved label changed, racktables updated [15:50:20] (03Merged) 10jenkins-bot: db-codfw.php: Decrease weight for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460043 (owner: 10Marostegui) [15:51:19] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Decrease weight for db2049 (duration: 00m 50s) [15:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:42] 10Operations, 10ops-eqiad, 10DC-Ops: Rename labvirt1019 and cloudvirt1020 to cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T204004 (10Cmjohnson) 05Open>03Resolved updated labels and racktables [15:53:42] (03PS4) 10Alexandros Kosiaris: db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/458787 (https://phabricator.wikimedia.org/T203776) [15:53:54] (03CR) 10Alexandros Kosiaris: [C: 032] db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/458787 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [15:54:43] the notice about the maintenance could be removed now, isn't? [15:55:15] JohanJ: we 've reenabled the translation for CN. The maint banner can be removed. Thanks! [15:55:19] yannf: thanks for reminding me [15:55:26] Thanks. [15:55:30] ok ;) [15:55:42] jouncebot, next [15:55:43] In 0 hour(s) and 4 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180912T1600) [15:55:57] are ops ready for that? [15:56:06] !log switch s*-master DNS records to codfw [15:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:18] oh doesn't look like anyone has listed anything for it anyway [15:56:19] Krenair: there is nothing in that window [15:56:26] neither the evening one [15:56:48] 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475 (10Cmjohnson) [15:56:51] * gehel can add stuff to that window if needed :) [15:56:53] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625 (10Cmjohnson) 05Open>03Resolved All serial connections have been fixed to be a standard pin-out [15:57:07] (03CR) 10jenkins-bot: db-codfw.php: Decrease weight for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460043 (owner: 10Marostegui) [15:57:08] dcausse: feel free to switch elastisearch to codfw when you want [15:57:15] just give me a ping beforehand [16:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180912T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:01:17] (03PS1) 10Jcrespo: mariadb: Reduce db2065 database weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460045 [16:02:02] (03CR) 10Marostegui: [C: 031] mariadb: Reduce db2065 database weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460045 (owner: 10Jcrespo) [16:03:28] (03CR) 10Jcrespo: [C: 032] mariadb: Reduce db2065 database weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460045 (owner: 10Jcrespo) [16:04:48] (03Merged) 10jenkins-bot: mariadb: Reduce db2065 database weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460045 (owner: 10Jcrespo) [16:06:33] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Reduce db2056 weight (duration: 00m 49s) [16:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:08:47] (03PS1) 10Marostegui: db-codfw.php: Decrease weight for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460046 [16:09:05] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:10:06] (03CR) 10Marostegui: [C: 032] db-codfw.php: Decrease weight for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460046 (owner: 10Marostegui) [16:10:55] (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Decrease weight for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460046 (owner: 10Marostegui) [16:11:37] (03CR) 10Marostegui: [C: 032] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460046 (owner: 10Marostegui) [16:11:39] (03CR) 10Banyek: [C: 031] db-codfw.php: Decrease weight for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460046 (owner: 10Marostegui) [16:12:12] (03CR) 10jenkins-bot: mariadb: Reduce db2065 database weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460045 (owner: 10Jcrespo) [16:13:09] (03Merged) 10jenkins-bot: db-codfw.php: Decrease weight for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460046 (owner: 10Marostegui) [16:13:59] 10Operations, 10Puppet, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1): exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10Jdlrobson) [16:14:05] 10Operations, 10ops-eqdfw: unrack/decom cr1-eqdfw - https://phabricator.wikimedia.org/T202700 (10Papaul) [16:14:07] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10Cmjohnson) 05Open>03Resolved I do not see any raid alerts in icinga...resolving [16:14:12] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Decrease weight for db2049 (duration: 00m 49s) [16:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:10] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10Dzahn) What about the revert? was it done? It doesn't look like it. Creating it. [16:18:17] (03PS1) 10Dzahn: Revert "sre.switchdc.mediawiki: parsoid skip broken host" [cookbooks] - 10https://gerrit.wikimedia.org/r/460049 [16:18:58] (03CR) 10Dzahn: "i just created this by clicking in Gerrit based on the comment https://phabricator.wikimedia.org/T196886#4573502 and because the ticket w" [cookbooks] - 10https://gerrit.wikimedia.org/r/460049 (owner: 10Dzahn) [16:19:14] 10Operations, 10ops-eqdfw: unrack/decom cr1-eqdfw - https://phabricator.wikimedia.org/T202700 (10Papaul) [16:19:30] 10Operations, 10ops-eqdfw: unrack/decom cr1-eqdfw - https://phabricator.wikimedia.org/T202700 (10Papaul) 05Open>03Resolved [16:19:49] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/operations/cookbooks/+/460049/ [16:20:02] (03PS1) 10Jcrespo: mariadb: Decrease db2065 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460050 [16:20:27] (03CR) 10Marostegui: [C: 031] mariadb: Decrease db2065 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460050 (owner: 10Jcrespo) [16:20:51] 10Operations, 10Release-Engineering-Team (Kanban), 10User-MModell: Create keyholder gerrit repo - https://phabricator.wikimedia.org/T203108 (10mmodell) a:05mmodell>03faidon [16:20:56] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10Dzahn) 05Resolved>03Open [16:21:10] (03CR) 1020after4: [C: 031] Add setuptools, LICENSE, README.rst etc. [software/keyholder] - 10https://gerrit.wikimedia.org/r/458224 (owner: 10Faidon Liambotis) [16:21:51] (03PS2) 10Jcrespo: mariadb: Decrease db2065 and db2061 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460050 [16:21:54] twentyafterfour: "I'd resolve this task, but I'm not sure what else needs to be done with regards to GitHub mirroring, CI etc.?" is why I left that open [16:21:56] marostegui: I have amemded [16:22:06] lets see [16:22:17] (03CR) 10Marostegui: [C: 031] mariadb: Decrease db2065 and db2061 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460050 (owner: 10Jcrespo) [16:22:32] (03PS5) 10Dduvall: ci: Give Docker more space on large-disk instances [puppet] - 10https://gerrit.wikimedia.org/r/459875 (https://phabricator.wikimedia.org/T203841) [16:23:15] (03CR) 10Jcrespo: [C: 032] mariadb: Decrease db2065 and db2061 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460050 (owner: 10Jcrespo) [16:23:36] paravoid: looks like you did a lot of good work which I'm going over with thcipriani. We can set up CI but we noticed neither of us has +2 on the repo [16:23:37] (03PS5) 10Ayounsi: Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/370103 [16:25:22] (03Merged) 10jenkins-bot: mariadb: Decrease db2065 and db2061 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460050 (owner: 10Jcrespo) [16:25:24] (03CR) 10jerkins-bot: [V: 04-1] Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/370103 (owner: 10Ayounsi) [16:25:57] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10akosiaris) https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wtp1043&service=MD+RAID still complains btw. [16:27:00] paravoid: mirror repo created on github [16:27:37] (03CR) 10jenkins-bot: db-codfw.php: Decrease weight for db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460046 (owner: 10Marostegui) [16:27:39] (03CR) 10jenkins-bot: mariadb: Decrease db2065 and db2061 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460050 (owner: 10Jcrespo) [16:28:25] twentyafterfour: oh sorry, it just inherited rights from operations/software [16:28:28] not intentiona [16:28:29] l [16:28:37] feel free to give +2 rights to yourselves obviously :) [16:28:52] or lmk how I can, if there's a group or something [16:28:56] in a meeting atm [16:29:03] (03PS5) 10Ayounsi: Icinga: add check_vcp (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) [16:29:11] paravoid: no prob, thanks will take care of it [16:32:19] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Reduce db2065, db2061 weights (duration: 00m 48s) [16:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:56] (03CR) 10Alexandros Kosiaris: [C: 031] Cirrus default cluster should be "local" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460028 (owner: 10EBernhardson) [16:33:13] (03CR) 10Ayounsi: "Not sure what is the cleanest way to solve that error:" [puppet] - 10https://gerrit.wikimedia.org/r/370103 (owner: 10Ayounsi) [16:33:44] (03PS2) 10Papaul: DNS: Add mgmt DNS for frauth2001 and remove old asset tag entries [dns] - 10https://gerrit.wikimedia.org/r/460042 (https://phabricator.wikimedia.org/T204079) [16:34:24] (03PS1) 10Ottomata: Vary camus eventbus topic check based on active MW DC [puppet] - 10https://gerrit.wikimedia.org/r/460054 [16:34:31] (03CR) 10EBernhardson: [C: 032] Cirrus default cluster should be "local" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460028 (owner: 10EBernhardson) [16:35:53] (03Merged) 10jenkins-bot: Cirrus default cluster should be "local" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460028 (owner: 10EBernhardson) [16:39:32] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cirrussearch: eqiad -> local (codfw) (duration: 00m 50s) [16:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:39] (03CR) 10jenkins-bot: Cirrus default cluster should be "local" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460028 (owner: 10EBernhardson) [16:43:34] (03CR) 10Volans: [C: 031] "Nice work! Please double check the comment inline, otherwise LGTM." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [16:45:06] objections to me depooling wtp2020? this is to troubleshoot T194176 [16:45:07] T194176: wtp2020 correctable memory errors - https://phabricator.wikimedia.org/T194176 [16:45:15] akosiaris perhaps? [16:46:09] (03CR) 10Dzahn: Elasticsearch shard size check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [16:46:17] (03PS1) 10BBlack: cp1099: add authdns::testns and ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/460057 [16:46:19] godog: no, feel free [16:46:37] sweet, thanks [16:46:41] (03CR) 10Gehel: [C: 031] Elasticsearch shard size check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [16:46:47] the cluster is below 25% cpu usage , should be fine [16:46:53] (03CR) 10jerkins-bot: [V: 04-1] cp1099: add authdns::testns and ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/460057 (owner: 10BBlack) [16:47:43] (03CR) 10Dzahn: Elasticsearch shard size check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [16:48:16] (03PS1) 10Marostegui: db-codfw.php: Decrease db2065 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460058 [16:48:33] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=wtp2020.codfw.wmnet [16:48:35] (03Abandoned) 10Marostegui: db-codfw.php: Repool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460024 (owner: 10Marostegui) [16:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:45] (03CR) 10BBlack: [V: 032 C: 032] "Jenkins doesn't like the "include role", but we're just copying cp1008 in a transition here, can be fixed later!" [puppet] - 10https://gerrit.wikimedia.org/r/460057 (owner: 10BBlack) [16:49:38] (03CR) 10Marostegui: [C: 032] db-codfw.php: Decrease db2065 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460058 (owner: 10Marostegui) [16:49:48] 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10akosiaris) [16:49:50] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10akosiaris) [16:49:52] 10Operations, 10Goal, 10Patch-For-Review: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from codfw - https://phabricator.wikimedia.org/T203776 (10akosiaris) 05Open>03Resolved a:03akosiaris The switchover has happened successfully, I am gonna happily... [16:50:34] (03PS1) 10Ottomata: Remove ReadingDepth from EventLogging MySQL whitelist [puppet] - 10https://gerrit.wikimedia.org/r/460059 (https://phabricator.wikimedia.org/T203596) [16:50:51] (03Merged) 10jenkins-bot: db-codfw.php: Decrease db2065 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460058 (owner: 10Marostegui) [16:50:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nope, not yet. wtp1043 is still complaining in the icinga UI about the RAID" [cookbooks] - 10https://gerrit.wikimedia.org/r/460049 (owner: 10Dzahn) [16:51:11] (03CR) 10Dduvall: [C: 031] "Tested on a new xlarge, no puppet errors:" [puppet] - 10https://gerrit.wikimedia.org/r/459875 (https://phabricator.wikimedia.org/T203841) (owner: 10Dduvall) [16:51:33] (03CR) 10Ottomata: [C: 032] Remove ReadingDepth from EventLogging MySQL whitelist [puppet] - 10https://gerrit.wikimedia.org/r/460059 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [16:51:35] 10Operations, 10ops-codfw: wtp2020 correctable memory errors - https://phabricator.wikimedia.org/T194176 (10fgiunchedi) @papaul I've powered off wtp2020 so you can debug this further, feel free to upgrade/etc as needed and/or reseat/swap the memory [16:51:41] (03PS2) 10Ottomata: Remove ReadingDepth from EventLogging MySQL whitelist [puppet] - 10https://gerrit.wikimedia.org/r/460059 (https://phabricator.wikimedia.org/T203596) [16:51:43] (03CR) 10Ottomata: [V: 032 C: 032] Remove ReadingDepth from EventLogging MySQL whitelist [puppet] - 10https://gerrit.wikimedia.org/r/460059 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [16:52:00] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Decrease weight for db2065 (duration: 00m 49s) [16:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:28] (03CR) 10Dzahn: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/12432/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [16:52:42] 10Operations, 10hardware-requests: Request for swift ms-be refresh - https://phabricator.wikimedia.org/T201938 (10RobH) 05Open>03Resolved a:03RobH Please note that I'm resolving this task, as the 3 systems each for codfw and eqiad have been rolled into the #procurement tasks T201937 & T204133. [16:53:32] !log restarting eventlogging processors to pick up change to mysql whitelist [16:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:33] (03PS5) 10Dzahn: monitoring: enable using notes_url with grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/459862 (https://phabricator.wikimedia.org/T197873) [16:55:41] (03CR) 10Volans: [C: 031] "Confrimed, all good." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [16:56:14] (03CR) 10Dzahn: "this allows us to add an additional info URL with Grafana-based alerts" [puppet] - 10https://gerrit.wikimedia.org/r/459862 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [16:56:34] (03CR) 10Dzahn: [C: 032] monitoring: enable using notes_url with grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/459862 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [16:57:00] (03CR) 10jenkins-bot: db-codfw.php: Decrease db2065 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460058 (owner: 10Marostegui) [16:58:05] 10Operations, 10Discovery-Search: Warn when CirrusSearch is not configured to use local DCfor an extended time - https://phabricator.wikimedia.org/T204135 (10Gehel) [16:58:22] (03PS3) 10Dzahn: icinga::performance: remind users to ignore checks using notes_url [puppet] - 10https://gerrit.wikimedia.org/r/459864 (https://phabricator.wikimedia.org/T203485) [16:58:36] PROBLEM - DPKG on cp1099 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:58:45] PROBLEM - Check systemd state on cp1099 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:58:49] (03CR) 10Thcipriani: [V: 032 C: 032] "neat trick :)" [software/keyholder] - 10https://gerrit.wikimedia.org/r/458223 (owner: 10Faidon Liambotis) [16:59:13] cp1099 - looks like -traffic is working on it [16:59:41] cp1099 is in role test, I don't know why it would alert at all :P [16:59:55] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 50.76 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:00:22] anyways, I'll downtime it forever [17:00:37] bblack: see #-traffic, maybe set profile::base::notifications_enabled to 0? [17:00:59] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) >>! In T202596#4577106, @Halfak wrote: > With the wikitext slot, we won't know which note relates to which jud... [17:01:05] the basic checks from base module are even added when using test or spare role [17:01:13] but maybe they shouldnt [17:01:52] (03CR) 10Smalyshev: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/459831 (owner: 10Smalyshev) [17:02:19] 10Operations, 10Discovery-Search, 10Datacenter-Switchover-2018: Warn when CirrusSearch is not configured to use local DCfor an extended time - https://phabricator.wikimedia.org/T204135 (10Gehel) [17:02:43] ah, there we go. role(spare::system) has profile::base::notifications_enabled: '0' but role(test) does not [17:02:51] let's add that to the test role i suppose [17:04:46] (03PS1) 10Dzahn: icinga: disable notifications for hosts using role(test) [puppet] - 10https://gerrit.wikimedia.org/r/460064 [17:04:56] (03PS8) 10Bstorm: quarry::database: Use mariadb instead of mysql module [puppet] - 10https://gerrit.wikimedia.org/r/454481 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [17:06:06] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 57.32 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:06:35] (03CR) 10Ottomata: [C: 032] Vary camus eventbus topic check based on active MW DC [puppet] - 10https://gerrit.wikimedia.org/r/460054 (owner: 10Ottomata) [17:06:42] (03PS2) 10Ottomata: Vary camus eventbus topic check based on active MW DC [puppet] - 10https://gerrit.wikimedia.org/r/460054 [17:06:45] (03CR) 10Ottomata: [V: 032 C: 032] Vary camus eventbus topic check based on active MW DC [puppet] - 10https://gerrit.wikimedia.org/r/460054 (owner: 10Ottomata) [17:07:02] (03CR) 10Bstorm: [C: 032] quarry::database: Use mariadb instead of mysql module [puppet] - 10https://gerrit.wikimedia.org/r/454481 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [17:07:25] (03PS9) 10Bstorm: quarry::database: Use mariadb instead of mysql module [puppet] - 10https://gerrit.wikimedia.org/r/454481 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [17:07:57] (03PS4) 10Dzahn: icinga::performance: remind users to ignore checks using notes_url [puppet] - 10https://gerrit.wikimedia.org/r/459864 (https://phabricator.wikimedia.org/T203485) [17:08:34] (03CR) 10Thcipriani: [V: 032 C: 032] Add setuptools, LICENSE, README.rst etc. [software/keyholder] - 10https://gerrit.wikimedia.org/r/458224 (owner: 10Faidon Liambotis) [17:09:17] (03CR) 10Alexandros Kosiaris: [C: 031] "+1 but let's stall this for the next couple of days and merge on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/460021 (https://phabricator.wikimedia.org/T191921) (owner: 10Thcipriani) [17:09:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Switching to -1 to depict my previous comment" [puppet] - 10https://gerrit.wikimedia.org/r/460021 (https://phabricator.wikimedia.org/T191921) (owner: 10Thcipriani) [17:10:04] (03CR) 10Dzahn: [C: 032] icinga::performance: remind users to ignore checks using notes_url [puppet] - 10https://gerrit.wikimedia.org/r/459864 (https://phabricator.wikimedia.org/T203485) (owner: 10Dzahn) [17:10:12] (03PS5) 10Dzahn: icinga::performance: remind users to ignore checks using notes_url [puppet] - 10https://gerrit.wikimedia.org/r/459864 (https://phabricator.wikimedia.org/T203485) [17:11:06] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 79.33 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:15:16] (03CR) 10Thcipriani: [V: 032 C: 032] Add pytest support for unit/integration testing [software/keyholder] - 10https://gerrit.wikimedia.org/r/458225 (owner: 10Faidon Liambotis) [17:15:54] (03PS10) 10Bstorm: quarry: Move the install into a venv and upgrade to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [17:18:04] (03CR) 10Bstorm: [C: 032] quarry: Move the install into a venv and upgrade to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [17:18:22] (03CR) 10BryanDavis: quarry: Move the install into a venv and upgrade to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [17:18:24] (03CR) 10Bstorm: [C: 032] "merging after much discussion" [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [17:24:44] (03PS1) 10Thcipriani: Add tox.ini [software/keyholder] - 10https://gerrit.wikimedia.org/r/460065 [17:25:53] (03CR) 10MSantos: "There are some entries in other files, but I am not sure if they should be taken out in the current step, the files are:" [puppet] - 10https://gerrit.wikimedia.org/r/460006 (https://phabricator.wikimedia.org/T202898) (owner: 10Gehel) [17:27:42] (03CR) 10Gehel: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/460006 (https://phabricator.wikimedia.org/T202898) (owner: 10Gehel) [17:30:47] passive checks have gone awol on einsteinium [17:31:17] but no defunct processes... so maybe transient [17:31:35] mmmh [17:31:38] might need a restart [17:32:02] T196336 [17:32:03] T196336: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 [17:32:14] yeah I trying to figure out if nay or yay [17:32:21] 10Operations, 10ops-codfw: wtp2020 correctable memory errors - https://phabricator.wikimedia.org/T194176 (10Papaul) a:05Papaul>03fgiunchedi - Re-seat memory - Upgrade BIOS from version 2.3 to 2.6 - Upgrade IDRAC from version 1.4 to 2.60 Server is back up [17:32:22] well I can do it anyway [17:32:31] you can try to downtime something [17:32:35] and if it doesn't work [17:32:40] you know it's that [17:32:41] :D [17:32:45] nsca ? [17:32:53] what does nsca have to do with downtiming ? [17:32:54] no it's the command file [17:33:05] ah nsca get stuck because of the command file ? [17:33:09] yep [17:33:14] icinga stop processing the command file [17:33:15] somehow I am losing interest already [17:33:20] I never hat time to dig into it [17:33:22] damn icinga [17:33:27] it's 2018 and we still have it [17:34:03] yeah [17:34:11] (03CR) 10Muehlenhoff: [C: 031] "Looks good, I'll merge tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/459881 (owner: 10Legoktm) [17:34:28] it's all your fault akosiaris, you didn't propose a better alternative when it was, 2y ago? :-P [17:34:34] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/12433/" [puppet] - 10https://gerrit.wikimedia.org/r/459660 (owner: 10Dzahn) [17:34:40] 3 maybe [17:34:46] I even tried the migration [17:34:50] never had the time to finish it [17:35:06] turns out the project is now in a quagmire [17:35:13] so probably it was for the best [17:35:23] lol [17:35:28] do you want me to restart it? [17:35:52] I am looking into it a bit [17:36:02] let's see how much this can keep me interested [17:36:07] ack, I'll leave it to you then [17:36:23] I arrived to the point I needed to start chasing down the source code and had no time [17:36:32] and never had time again since :D [17:36:47] but maybe some smart perf/strace will give you the culprit [17:37:36] yeah it's not scheduling downtime either [17:37:48] so .. restart icinga ? [17:37:57] yep that fixes it :) [17:38:22] log with the task, so we keep track of the frequency [17:38:29] T196336 [17:38:30] T196336: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 [17:38:37] (03CR) 10Thcipriani: "recheck" [software/keyholder] - 10https://gerrit.wikimedia.org/r/460065 (owner: 10Thcipriani) [17:38:55] !log restart icinga T196336 [17:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:06] happening a bit too often recently, I'll probably need to find the time to look at it sooner or later [17:40:43] 10Operations, 10Release-Engineering-Team (Kanban), 10User-MModell: Create keyholder gerrit repo - https://phabricator.wikimedia.org/T203108 (10mmodell) It's now mirroring to https://github.com/wikimedia/operations-software-keyholder/ [17:42:34] 10Operations, 10Release-Engineering-Team: Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10thcipriani) [17:42:41] 10Operations, 10Release-Engineering-Team (Kanban), 10User-MModell: Create keyholder gerrit repo - https://phabricator.wikimedia.org/T203108 (10thcipriani) 05Open>03Resolved >>! In T203108#4560449, @faidon wrote: > I'd resolve this task, but I'm not sure what else needs to be done with regards to GitHub m... [17:44:08] (03PS19) 10Gehel: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [17:44:48] (03PS3) 10Muehlenhoff: Print group memberships which granted Hadoop access to check for HDFS cleanups [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) [17:45:14] (03CR) 10Muehlenhoff: Print group memberships which granted Hadoop access to check for HDFS cleanups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) (owner: 10Muehlenhoff) [17:45:33] (03CR) 10Gehel: [C: 032] Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [17:46:08] onimisionipe: \o/ good work! ^^^ [17:47:12] onimisionipe: you're going to generate your first alert going red in a few minutes! [17:47:32] we already have too large shards? [17:47:33] :D [17:47:43] a check that's red from the beginning? :) [17:47:47] volans: why do you think we need a check ? [17:47:54] :-P [17:48:38] to make things clear: onimisionipe did a great job in writing that check, the fact that the check is red has nothing to do with his work! [17:49:05] yes [17:50:45] gehel volans: thanks for all the reviews...without them, I will not get anywhere.. [17:50:47] !log otto@deploy1001 Started deploy [analytics/refinery@407da92]: Deploying refinery-source 0.0.74 jars for T203804 [17:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:54] T203804: Refactor Refine job scalaopt to use property files and CLI overrides - https://phabricator.wikimedia.org/T203804 [17:50:56] (03PS11) 10Ottomata: Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) [17:51:19] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Traffic: Investigate source of 404 Not Found responses from load.php - https://phabricator.wikimedia.org/T202479 (10Krinkle) [17:51:25] (03CR) 10Ottomata: [V: 032 C: 032] Use properties file for RefineMonitor [puppet] - 10https://gerrit.wikimedia.org/r/460023 (https://phabricator.wikimedia.org/T203804) (owner: 10Ottomata) [17:52:02] onimisionipe: thank you for the patience ;) [17:52:12] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /srv 51296 MB (10% inode=99%) [17:52:28] !log otto@deploy1001 Finished deploy [analytics/refinery@407da92]: Deploying refinery-source 0.0.74 jars for T203804 (duration: 01m 41s) [17:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:04] !log otto@deploy1001 Started deploy [analytics/refinery@407da92]: Deploying refinery-source 0.0.74 jars for T203804 [17:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:12] RECOVERY - Disk space on elastic1024 is OK: DISK OK [17:53:23] (03PS1) 10Rush: phabricator: add risk rating to advanced creation form [puppet] - 10https://gerrit.wikimedia.org/r/460069 (https://phabricator.wikimedia.org/T204138) [17:55:24] (03PS3) 10Dzahn: DNS: Add mgmt DNS for frauth2001 and remove old asset tag entries [dns] - 10https://gerrit.wikimedia.org/r/460042 (https://phabricator.wikimedia.org/T204079) (owner: 10Papaul) [17:56:02] RECOVERY - Check systemd state on cp1099 is OK: OK - running: The system is fully operational [17:59:15] !log otto@deploy1001 Finished deploy [analytics/refinery@407da92]: Deploying refinery-source 0.0.74 jars for T203804 (duration: 06m 11s) [17:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:22] T203804: Refactor Refine job scalaopt to use property files and CLI overrides - https://phabricator.wikimedia.org/T203804 [17:59:33] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:59:33] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:59:53] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt DNS for frauth2001 and remove old asset tag entries [dns] - 10https://gerrit.wikimedia.org/r/460042 (https://phabricator.wikimedia.org/T204079) (owner: 10Papaul) [18:00:57] 10Operations, 10Maps-Sprint, 10Maps (Tilerator), 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10MSantos) To be more specific, when the `populate_admin()` script runs, tilerator throws the following logs: ``` message:... [18:01:13] (03CR) 1020after4: [C: 031] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/460069 (https://phabricator.wikimedia.org/T204138) (owner: 10Rush) [18:02:03] (03PS1) 10Ottomata: Update refinery-job jar used for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/460070 (https://phabricator.wikimedia.org/T203804) [18:02:42] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:02:43] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:02:51] (03PS2) 10Ottomata: Update refinery-job jar used for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/460070 (https://phabricator.wikimedia.org/T203804) [18:03:02] (03CR) 10Ottomata: [V: 032 C: 032] Update refinery-job jar used for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/460070 (https://phabricator.wikimedia.org/T203804) (owner: 10Ottomata) [18:03:34] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4576584, @daniel wrote: > Note that it's blocked on {T204112}. That's not particul... [18:04:44] (03PS2) 10Rush: phabricator: add risk rating to advanced creation form [puppet] - 10https://gerrit.wikimedia.org/r/460069 (https://phabricator.wikimedia.org/T204138) [18:10:28] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Papaul) [18:11:11] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) Another change which I'll document here is that I'm dropping the use cases for "write-only" workf... [18:15:06] (03PS1) 10Ottomata: Fix properties file for refine job [puppet] - 10https://gerrit.wikimedia.org/r/460071 (https://phabricator.wikimedia.org/T203804) [18:15:46] (03CR) 10Ottomata: [V: 032 C: 032] Fix properties file for refine job [puppet] - 10https://gerrit.wikimedia.org/r/460071 (https://phabricator.wikimedia.org/T203804) (owner: 10Ottomata) [18:19:12] (03CR) 10Rush: [C: 032] phabricator: add risk rating to advanced creation form [puppet] - 10https://gerrit.wikimedia.org/r/460069 (https://phabricator.wikimedia.org/T204138) (owner: 10Rush) [18:19:20] (03PS3) 10Rush: phabricator: add risk rating to advanced creation form [puppet] - 10https://gerrit.wikimedia.org/r/460069 (https://phabricator.wikimedia.org/T204138) [18:24:09] !log reindexing Esperanto wikis on elastic@codfw and elastic@eqiad (T203005) [18:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:17] T203005: Re-index Esperanto Wikis - https://phabricator.wikimedia.org/T203005 [18:42:57] (03PS2) 10Gehel: Allow kafka updater to have options [puppet] - 10https://gerrit.wikimedia.org/r/459831 (owner: 10Smalyshev) [18:42:59] 10Operations, 10Discovery-Search, 10Datacenter-Switchover-2018: Warn when CirrusSearch is not configured to use local DCfor an extended time - https://phabricator.wikimedia.org/T204135 (10EBernhardson) We almost have an [[ https://test.wikipedia.org/wiki/Special:ApiSandbox#action=cirrus-config-dump&format=js... [18:44:20] (03CR) 10Gehel: [C: 032] Allow kafka updater to have options [puppet] - 10https://gerrit.wikimedia.org/r/459831 (owner: 10Smalyshev) [18:48:22] PROBLEM - puppet last run on wdqs2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:06] oops, that puppet failure is probably me, checking [18:49:43] PROBLEM - puppet last run on wdqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:50:03] yep, definitely me, patch coming up [18:50:15] 10Operations, 10Cloud-Services, 10Parsing-Team, 10Datacenter-Switchover-2018, and 2 others: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." - https://phabricator.wikimedia.org/T163438 (10akosiaris) VE is functional on... [18:50:43] PROBLEM - puppet last run on wdqs2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:51:17] (03PS1) 10Jcrespo: mariadb: Decrease even more db2065 load, setup more api nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460073 [18:52:33] (03PS1) 10Gehel: wdqs: add default value for kafka_updater_options [puppet] - 10https://gerrit.wikimedia.org/r/460074 [18:54:23] PROBLEM - puppet last run on wdqs1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:54:52] (03CR) 10Gehel: [C: 032] wdqs: add default value for kafka_updater_options [puppet] - 10https://gerrit.wikimedia.org/r/460074 (owner: 10Gehel) [18:57:23] PROBLEM - puppet last run on wdqs2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:58:13] PROBLEM - puppet last run on wdqs2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:58:33] RECOVERY - puppet last run on wdqs2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:00:23] PROBLEM - puppet last run on wdqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:00:58] 10Operations: SRE quarterly goal: allow MediaWiki requests to be served by PHP7 alongside HHVM - https://phabricator.wikimedia.org/T203959 (10Legoktm) >>! In T203959#4571089, @Joe wrote: > We're probably not going to get to the stretch goals, but it should be noted that MediaWiki is still not ready to run on PHP... [19:04:11] (03CR) 10Alexandros Kosiaris: [C: 031] mariadb: Decrease even more db2065 load, setup more api nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460073 (owner: 10Jcrespo) [19:09:10] (03CR) 10Jcrespo: [C: 032] mariadb: Decrease even more db2065 load, setup more api nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460073 (owner: 10Jcrespo) [19:09:21] (03CR) 10Jcrespo: [C: 032] "Thanks Alex!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460073 (owner: 10Jcrespo) [19:10:34] (03Merged) 10jenkins-bot: mariadb: Decrease even more db2065 load, setup more api nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460073 (owner: 10Jcrespo) [19:13:47] (03CR) 10jenkins-bot: mariadb: Decrease even more db2065 load, setup more api nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460073 (owner: 10Jcrespo) [19:13:49] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Decrease db2065 load (duration: 00m 49s) [19:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:12] RECOVERY - puppet last run on wdqs1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:20:43] RECOVERY - puppet last run on wdqs1006 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:21:22] RECOVERY - puppet last run on wdqs2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:22:45] (03PS1) 10Jcrespo: mariadb: Reduce db2061 api load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460079 [19:22:52] RECOVERY - puppet last run on wdqs2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:23:33] RECOVERY - puppet last run on wdqs2006 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [19:25:02] RECOVERY - puppet last run on wdqs1008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:26:06] (03PS4) 10Dzahn: mediawiki::web::prod_sites: make includes explicit in more wikis [puppet] - 10https://gerrit.wikimedia.org/r/451257 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [19:27:01] (03CR) 10Dzahn: [C: 04-1] "PS4: added contents from ./files/apache/sites/api-rewrites.incl below the api-rewrites comment" [puppet] - 10https://gerrit.wikimedia.org/r/451257 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [19:27:36] (03PS3) 10Gehel: maps: decommission maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/460006 (https://phabricator.wikimedia.org/T202898) [19:29:16] (03PS2) 10Jcrespo: mariadb: Reduce db2061 api load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460079 [19:31:53] (03CR) 10Gehel: [C: 032] maps: decommission maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/460006 (https://phabricator.wikimedia.org/T202898) (owner: 10Gehel) [19:32:18] (03PS4) 10Gehel: maps: decommission maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/460006 (https://phabricator.wikimedia.org/T202898) [19:35:37] (03CR) 10Jcrespo: "Maybe not needed? Errors went to 0 after my previous deploy. We will need to keep tuning not only to avoid errors, but to balance the late" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460079 (owner: 10Jcrespo) [19:38:51] 10Operations, 10Maps, 10Maps-Sprint, 10decommission, 10hardware-requests: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10Gehel) a:03RobH [19:39:02] wooo [19:39:06] death to more old servers [19:39:13] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 51108 MB (10% inode=99%) [19:45:15] (03CR) 10Marostegui: "And tomorrow we will probably also have at least one of the two hosts that are out back in the mix once it gets recloned. So we can rebala" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460079 (owner: 10Jcrespo) [19:45:25] (03CR) 10Jcrespo: "Finally deploying it, some small amount of errors still happening on 61." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460079 (owner: 10Jcrespo) [19:45:29] (03CR) 10Jcrespo: [C: 032] mariadb: Reduce db2061 api load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460079 (owner: 10Jcrespo) [19:46:52] (03Merged) 10jenkins-bot: mariadb: Reduce db2061 api load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460079 (owner: 10Jcrespo) [19:49:01] (03PS9) 10Bstorm: wiki replicas - prepare for refactored actor storage [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) [19:49:06] (03PS1) 10Jcrespo: mariadb: Add db2056 to api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460086 [19:49:46] (03CR) 10Imarlier: profile::mediawiki::php: add support for php-fpm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [19:50:42] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Decrease db2061 load (duration: 00m 50s) [19:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:19] (03CR) 10Bstorm: "I've now applied the reasoning from T174047 with consideration to the notion that forward compat changes should be in the main tables, not" [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [19:54:23] (03PS6) 10Dduvall: ci: Allow Docker nodes to use a dedicated /var/lib/docker volume [puppet] - 10https://gerrit.wikimedia.org/r/459875 (https://phabricator.wikimedia.org/T203841) [19:54:33] "PHP fatal error: [19:54:33] entire web request took longer than 60 seconds and timed out" [19:54:35] Hmm [19:56:53] RECOVERY - Disk space on elastic1025 is OK: DISK OK [19:57:28] (03CR) 10jenkins-bot: mariadb: Reduce db2061 api load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460079 (owner: 10Jcrespo) [20:06:01] Bsadowski1: Yeah, we started enforcing that again last week. What were you trying to do? [20:06:21] (A mistake had meant that we'd not been enforcing it for a couple of years or so. Whoops.) [20:07:17] It finally managed to complete. :P [20:07:55] 10Operations, 10ops-eqiad, 10DC-Ops: Rename labvirt1019 and cloudvirt1020 to cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T204004 (10Andrew) thanks chris! [20:09:49] 10Operations, 10ops-codfw, 10procurement: setup/install cumin2001.eqiad.wmnet - https://phabricator.wikimedia.org/T204156 (10RobH) p:05Triage>03Normal [20:14:53] <_joe_> James_F: actually, almost 4 :) [20:15:20] <_joe_> but it's truly 1 year that it was our fault :) [20:15:27] Oops. [20:25:27] 10Operations, 10Discovery-Search, 10Datacenter-Switchover-2018: Warn when CirrusSearch is not configured to use local DCfor an extended time - https://phabricator.wikimedia.org/T204135 (10Volans) I'm not too familiar with the order of inclusion of MW configs, but I was wondering if we could expose that via s... [20:30:40] (03CR) 10Jforrester: "Theoretically good to go after the wmf.22 train is live on testwiki…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454534 (https://phabricator.wikimedia.org/T198309) (owner: 10Daniel Kinzler) [20:40:23] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10Halfak) > The main thing driving us to that conclusion was that a "notes" field should be shared between damaging and... [21:18:02] 10Operations, 10Discovery-Search, 10Datacenter-Switchover-2018: Warn when CirrusSearch is not configured to use local DCfor an extended time - https://phabricator.wikimedia.org/T204135 (10Gehel) >>! In T204135#4578277, @Volans wrote: > I'm not too familiar with the order of inclusion of MW configs, but I was... [21:20:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsing-Team: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10Arlolra) [21:20:40] 10Operations, 10ops-codfw, 10Parsing-Team: wtp2020 correctable memory errors - https://phabricator.wikimedia.org/T194176 (10Arlolra) [21:22:42] (03CR) 10Dduvall: [C: 031] "After changing the allocation of docker/workspaces to 70/30% (from 85/15%) this now works for bigram instances as well. We have two each o" [puppet] - 10https://gerrit.wikimedia.org/r/459875 (https://phabricator.wikimedia.org/T203841) (owner: 10Dduvall) [21:44:07] (03CR) 10ArielGlenn: "What uses dba_* functions? I hunted in core but couldn't find anything." [puppet] - 10https://gerrit.wikimedia.org/r/459882 (owner: 10Legoktm) [21:46:35] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 51332 MB (10% inode=99%) [21:51:31] (03PS1) 10Volans: mediawiki: improve siteinfo checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) [22:04:24] (03CR) 10Legoktm: "It's a separate library now, see (gets deployed via mediawiki/vendor)." [puppet] - 10https://gerrit.wikimedia.org/r/459882 (owner: 10Legoktm) [22:05:03] 10Operations, 10ops-codfw: setup/install cumin2001.eqiad.wmnet - https://phabricator.wikimedia.org/T204156 (10RobH) [22:16:52] 10Operations, 10Discovery-Search, 10Datacenter-Switchover-2018: Warn when CirrusSearch is not configured to use local DCfor an extended time - https://phabricator.wikimedia.org/T204135 (10Volans) >>! In T204135#4578507, @Gehel wrote: > I don't know much about MW configs myself, and I'm not really sure what y... [22:17:24] RECOVERY - Disk space on elastic1025 is OK: DISK OK [22:24:46] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) p:05Triage>03Normal [22:29:43] 10Operations, 10DBA, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) with the migration to notedb accounts and changes have been removed from the... [22:30:34] 10Operations, 10ops-codfw: setup/install cumin2001.eqiad.wmnet - https://phabricator.wikimedia.org/T204156 (10RobH) [22:30:51] (03PS1) 10RobH: setting cumin2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/460124 (https://phabricator.wikimedia.org/T204156) [22:32:29] (03CR) 10RobH: [C: 032] setting cumin2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/460124 (https://phabricator.wikimedia.org/T204156) (owner: 10RobH) [22:33:01] (03CR) 10jerkins-bot: [V: 04-1] setting cumin2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/460124 (https://phabricator.wikimedia.org/T204156) (owner: 10RobH) [22:34:39] hrmm [22:34:40] whyyy [22:34:44] everthing in that file looks good. [22:34:57] 22:32:59 # error: rfc1035: Zone 10.in-addr.arpa.: Zonefile parse error at line 3495: General parse error [22:35:13] well, i added to 3491, not 3495 [22:35:17] and it looks right to me... [22:35:21] ohhh, wiat [22:35:25] i see my typo, goddamn it [22:35:44] (03PS2) 10RobH: setting cumin2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/460124 (https://phabricator.wikimedia.org/T204156) [22:36:05] (03PS1) 10Papaul: DNS: Add production dns entries for frauth2001 [dns] - 10https://gerrit.wikimedia.org/r/460127 [22:36:32] (03PS3) 10RobH: setting cumin2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/460124 (https://phabricator.wikimedia.org/T204156) [22:37:00] (03CR) 10RobH: [C: 032] setting cumin2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/460124 (https://phabricator.wikimedia.org/T204156) (owner: 10RobH) [22:38:41] damn zuul is busy. [22:41:31] !log reindexing Esperanto wikis on elastic@codfw and elastic@eqiad complete (T203005) [22:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:40] T203005: Re-index Esperanto Wikis - https://phabricator.wikimedia.org/T203005 [22:42:21] 10Operations, 10Readers-Web-Backlog (Tracking), 10Services (watching): Create Debian packages for Node.js 10 upgrade - https://phabricator.wikimedia.org/T203239 (10Jdlrobson) [22:43:35] PROBLEM - Restbase root url on restbase2003 is CRITICAL: HTTP CRITICAL - No data received from host [22:44:45] RECOVERY - Restbase root url on restbase2003 is OK: HTTP OK: HTTP/1.1 200 - 16052 bytes in 0.138 second response time [22:45:00] .... [22:45:03] damn still waiting on zuul [22:45:07] (03PS1) 10Andrew Bogott: Horizon: set a default network ID for neutron VMs. [puppet] - 10https://gerrit.wikimedia.org/r/460144 [22:48:57] (03PS1) 10Smalyshev: Temporarily enable debug logging for regex matches [puppet] - 10https://gerrit.wikimedia.org/r/460151 [22:50:13] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Papaul) ``` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member ge-0/0/16; - member ge-1/0/16; [edit in... [22:51:34] (03CR) 10Dzahn: [C: 04-1] "with the /29 netmask in 10.195.0.72/29 the range for hosts is only 192.195.0.73 - 192.195.0.78 and .79 is the broadcast address" [dns] - 10https://gerrit.wikimedia.org/r/460127 (owner: 10Papaul) [22:52:59] (03CR) 10Dzahn: "sorry, host range 10.195.0.73 - 10.195.0.78, .79 broadcast" [dns] - 10https://gerrit.wikimedia.org/r/460127 (owner: 10Papaul) [22:54:00] (03CR) 10Dzahn: "looks like we can use the .73 instead though but last host for this subnet" [dns] - 10https://gerrit.wikimedia.org/r/460127 (owner: 10Papaul) [22:54:08] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Papaul) ``` papaul@fasw-c-codfw> show interfaces ge-0/0/16 Physical interface: ge-0/0/16, Enabled, Physical link is Up papaul@fasw-c-codfw> show... [22:54:32] (03PS1) 10RobH: set cumin2001 isntall params [puppet] - 10https://gerrit.wikimedia.org/r/460154 (https://phabricator.wikimedia.org/T204156) [22:54:42] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Papaul) [22:57:01] 10Operations, 10DBA, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn) You are saying we won't need any mysql/mariadb for Gerrit anymore? [22:58:13] 10Operations, 10DBA, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) Yep, but currently 2.x will still require a db just 2.15 does not read change... [22:58:57] (03CR) 10Dzahn: "Thank you for this merge Bstorm!:) yay" [puppet] - 10https://gerrit.wikimedia.org/r/454481 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [22:59:18] (03PS1) 10RobH: setting cumin2001 ipv6 entry [dns] - 10https://gerrit.wikimedia.org/r/460159 (https://phabricator.wikimedia.org/T204156) [22:59:38] (03CR) 10RobH: [C: 032] set cumin2001 isntall params [puppet] - 10https://gerrit.wikimedia.org/r/460154 (https://phabricator.wikimedia.org/T204156) (owner: 10RobH) [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180912T2300). [23:00:05] Ebe123: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:15] (03CR) 10RobH: [C: 032] setting cumin2001 ipv6 entry [dns] - 10https://gerrit.wikimedia.org/r/460159 (https://phabricator.wikimedia.org/T204156) (owner: 10RobH) [23:00:34] * Ebe123 is ready! https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Score/+/458910 [23:00:49] 10Operations, 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205 (10Dzahn) This might be resolved now since the merge above. Much appreciated, thank you @zhuyifei1999 and @Bstorm Did that already actively switch it? [23:01:54] 10Operations, 10ops-codfw: setup/install cumin2001.eqiad.wmnet - https://phabricator.wikimedia.org/T204156 (10RobH) [23:02:30] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) quarry doesn't use the mysql module anymore since https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/454481/ was merged.... [23:07:02] 10Operations, 10ops-codfw: apply hostname label to cumin2001 / wmf6407 and update visible label field in racktables - https://phabricator.wikimedia.org/T204173 (10RobH) p:05Triage>03Normal [23:07:26] 10Operations, 10ops-codfw: setup/install cumin2001.eqiad.wmnet - https://phabricator.wikimedia.org/T204156 (10RobH) a:03RobH [23:17:18] 10Operations, 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205 (10zhuyifei1999) The new instance quarry-db-01 is now on mariadb, but the old instance quarry-main-01 (with puppet disabled) is still on mysql. If everyt... [23:26:07] 10Operations, 10ops-codfw: setup/install cumin2001.eqiad.wmnet - https://phabricator.wikimedia.org/T204156 (10RobH) [23:26:49] Is there the SWAT tonight? [23:29:25] 10Operations: setup/install cumin2001.eqiad.wmnet - https://phabricator.wikimedia.org/T204156 (10RobH) a:05RobH>03MoritzMuehlenhoff @MoritzMuehlenhoff, cumin2001 has now had role:spare applied and is awaiting service implementation. I'm assuming you are the person to do this, as you were involved in the re... [23:29:29] jouncebot: now [23:29:29] For the next 0 hour(s) and 30 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180912T2300) [23:29:31] jouncebot: next [23:29:32] In 0 hour(s) and 30 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180913T0000) [23:29:39] Ebe123: Apparently [23:30:09] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Score/+/458910 is scheduled... [23:30:15] Yeah, hang on [23:30:24] greg-g: Are we ok to deploy post dc migration? [23:31:35] Oh, there's the big migration :) I can take another time if that's getting in the way [23:31:58] AFAIK things are all ok after some issues [23:32:01] Just want to double check [23:33:05] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /srv 49150 MB (10% inode=99%) [23:33:41] 10Operations, 10Maps, 10Maps-Sprint, 10decommission: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10RobH) [23:33:53] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Papaul) @Jgreen @Dzahn has a comment on https://gerrit.wikimedia.org/r/#/c/operations/dns/+/460127/ the network 10.195.0.72/29 can only hold 6 host... [23:35:15] RECOVERY - Disk space on elastic1024 is OK: DISK OK [23:36:22] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@7e5e537]: Deploy Blazegraph & Updater for T202765 and T203646 handling [23:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:31] T203646: Wikidata Query Service nodes out of sync - https://phabricator.wikimedia.org/T203646 [23:40:58] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad | (14 + 6) hadoop hardware refresh and expansion - https://phabricator.wikimedia.org/T199673 (10RobH) [23:42:05] Reedy, they were still on the schedule [23:42:28] and AFAIK ops aren't firefighting anything at the moment [23:45:04] (03PS1) 10Ayounsi: Create shell account for kharlan and add to researchers [puppet] - 10https://gerrit.wikimedia.org/r/460175 (https://phabricator.wikimedia.org/T203847) [23:49:11] (03CR) 10Dzahn: [C: 031] Create shell account for kharlan and add to researchers [puppet] - 10https://gerrit.wikimedia.org/r/460175 (https://phabricator.wikimedia.org/T203847) (owner: 10Ayounsi) [23:51:44] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers for kharlan - https://phabricator.wikimedia.org/T203847 (10ayounsi) a:03ayounsi @kostajh could you please sign https://phabricator.wikimedia.org/L3 ? [23:53:17] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers for kharlan - https://phabricator.wikimedia.org/T203847 (10ayounsi) [23:58:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers for kharlan - https://phabricator.wikimedia.org/T203847 (10ayounsi)