[00:00:04] twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180628T0000). [00:05:23] !log taking apache offline momentarily on phab1001 [00:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:47] PROBLEM - https://phabricator.wikimedia.org on phab1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2428 bytes in 0.010 second response time [00:11:58] RECOVERY - https://phabricator.wikimedia.org on phab1002 is OK: HTTP OK: HTTP/1.1 200 OK - 32278 bytes in 0.270 second response time [00:12:57] !log phabricator update failed. unable to apply database migrations: mysql access denied [00:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:20] !log rolled back and restored service to previous state [00:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:35] Uhm, wtf [00:28:40] A full scap breaks on wmf.10 [00:28:53] 1) when lint fails, it should tell you why, not just say "exit status 123" [00:29:00] 2) why on earth do we have syntax errors in vendor [00:29:05] Fatal error: syntax error, unexpected T_CONST, expecting T_VARIABLE in /srv/mediawiki-staging/php-1.32.0-wmf.10/vendor/psy/psysh/test/ClassWithSecrets.php on line 16 [00:29:43] Oh wait I wasn't trying a full scap, rather sync-dir php-1.32.0-wmf.10 [00:31:54] OK then I will have to violate policy and sync includes/ and resources/ separately with two syncs [00:32:49] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.10/includes: Watchlist perf patches for SWAT, part 1 (T197168, T198140, T198142) (duration: 01m 13s) [00:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:54] T197168: Fix slow Watchlist load and interaction times - https://phabricator.wikimedia.org/T197168 [00:32:54] T198140: Prevent updateInputSize() in mw.rcfilters.ui.FilterTagMultiselectWidget - https://phabricator.wikimedia.org/T198140 [00:32:54] T198142: Speed up lazy-building of menu - https://phabricator.wikimedia.org/T198142 [00:33:57] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.10/resources: Watchlist perf patches for SWAT, part 2 (T197168, T198140, T198142) (duration: 00m 57s) [00:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:29] (03PS7) 10Smalyshev: Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T198356) [00:58:00] (03CR) 10jerkins-bot: [V: 04-1] Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T198356) (owner: 10Smalyshev) [01:30:48] RoanKattouw: Be sure to file a task if there isn't one already. [01:31:07] psysh was updated very recently, train block worthy imho [01:34:55] (03CR) 10Krinkle: [C: 04-1] "In a later patch is fine, and per Aaron, might even be obsolete if it ends up removed indeed. If the key name is the only difference then " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440469 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [02:00:38] PROBLEM - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 4 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad%2520prometheus%252Fops [02:06:28] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: m.{project}.org portal/redirect consistency - https://phabricator.wikimedia.org/T78421#4320893 (10MZMcBride) [02:21:29] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.8) (duration: 07m 54s) [02:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:09] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.10) (duration: 13m 47s) [02:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:34] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Thu Jun 28 03:03:34 UTC 2018 (duration 10m 25s) [03:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:22] (03PS8) 10Smalyshev: Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T198356) [03:13:48] (03CR) 10Smalyshev: [C: 031] Add cirrussearch settings for wikibase (1.5/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442317 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [03:14:13] (03CR) 10Smalyshev: [C: 031] Add cirrussearch settings for wikibase (2/3) (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442318 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [03:14:30] (03CR) 10Smalyshev: [C: 031] Add cirrussearch settings for wikibase (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441057 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [04:35:21] twentyafterfour: Can you try again for T198367 [04:35:21] T198367: Mysql Access denied to 'phadmin'@'10.64.0.198' - https://phabricator.wikimedia.org/T198367 [04:35:21] ? [04:42:17] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442758 (https://phabricator.wikimedia.org/T191316) [04:43:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442758 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [04:44:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442758 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [04:45:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1099:3318 for alter table (duration: 00m 59s) [04:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:19] !log Deploy schema change on db1099:3318 T191316 T192926 T89737 T195193 [04:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:23] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [04:46:23] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [04:46:23] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [04:46:24] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [04:48:24] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442758 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:30:06] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Investigate HTTP 500 on POST request to WDQS - https://phabricator.wikimedia.org/T198055#4320949 (10Smalyshev) p:05Triage>03Normal [05:30:16] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Enable async logging on Wikidata Query Service - https://phabricator.wikimedia.org/T198051#4320950 (10Smalyshev) p:05Triage>03Normal [05:49:11] (03PS2) 10ArielGlenn: use iohandlers for recompressxml input and output [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/441485 [05:49:13] (03PS1) 10ArielGlenn: option to skip siteinfo header, mw footer for recompresing files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/442774 [05:49:15] (03PS1) 10ArielGlenn: options for writeuptopageid to skip writing header or footer [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/442775 [06:27:57] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt] [06:29:18] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/vim/vimrc.local] [06:30:07] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssh/userkeys/root.d/labstore] [06:30:28] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/gen_fingerprints] [06:55:28] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:55:48] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:56:09] (03PS2) 10Muehlenhoff: Add trusty-wikimedia to known-dists [puppet] - 10https://gerrit.wikimedia.org/r/442325 [06:58:27] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:20] (03CR) 10Muehlenhoff: [C: 032] Add trusty-wikimedia to known-dists [puppet] - 10https://gerrit.wikimedia.org/r/442325 (owner: 10Muehlenhoff) [06:59:48] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:12:12] !log upload piwik 3.2.1 to jessie-wikimedia [07:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:42] 10Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287#4321016 (10MoritzMuehlenhoff) @Cmjohnson Thanks, sounds good. [07:30:05] (03PS1) 10Joal: Remove ORM jar from sqoop cron command [puppet] - 10https://gerrit.wikimedia.org/r/442780 (https://phabricator.wikimedia.org/T196912) [07:30:15] elukey: --^ please :) [07:30:56] (03CR) 10Elukey: [C: 032] Remove ORM jar from sqoop cron command [puppet] - 10https://gerrit.wikimedia.org/r/442780 (https://phabricator.wikimedia.org/T196912) (owner: 10Joal) [07:33:24] (03PS1) 10Muehlenhoff: Add tarrow to LDAP users list [puppet] - 10https://gerrit.wikimedia.org/r/442781 (https://phabricator.wikimedia.org/T196434) [07:34:17] (03PS2) 10Muehlenhoff: Add tarrow to LDAP users list [puppet] - 10https://gerrit.wikimedia.org/r/442781 (https://phabricator.wikimedia.org/T196434) [07:34:57] (03CR) 10Muehlenhoff: [C: 032] Add tarrow to LDAP users list [puppet] - 10https://gerrit.wikimedia.org/r/442781 (https://phabricator.wikimedia.org/T196434) (owner: 10Muehlenhoff) [07:38:38] (03CR) 10Tarrow: Add tarrow to LDAP users list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/442781 (https://phabricator.wikimedia.org/T196434) (owner: 10Muehlenhoff) [07:40:23] (03CR) 10Muehlenhoff: [C: 032] Add tarrow to LDAP users list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/442781 (https://phabricator.wikimedia.org/T196434) (owner: 10Muehlenhoff) [07:42:26] (03PS1) 10Muehlenhoff: Fix email address for tarrow [puppet] - 10https://gerrit.wikimedia.org/r/442785 [07:43:14] (03CR) 10Muehlenhoff: [C: 032] Fix email address for tarrow [puppet] - 10https://gerrit.wikimedia.org/r/442785 (owner: 10Muehlenhoff) [08:09:34] !log updating librdkafka1 && restart varnishkafka instances in cache::text nodes - T182993 [08:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:37] T182993: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993 [08:11:54] (03PS8) 10Muehlenhoff: debmonitor: install debmonitor-client [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [08:15:03] !log restart-hhvm on mw1227 (some threads stuck in jit-related operations, causing high load) [08:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:24] (03CR) 10Volans: [C: 032] debmonitor: install debmonitor-client [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [08:16:42] (03PS5) 10ArielGlenn: generate temp stubs for page ranges serially from same input stub file [dumps] - 10https://gerrit.wikimedia.org/r/436956 (https://phabricator.wikimedia.org/T196063) [08:18:13] (03PS3) 10Gehel: maps: isolate maps-test2003 and reimage it to stretch [puppet] - 10https://gerrit.wikimedia.org/r/442258 (https://phabricator.wikimedia.org/T198290) [08:18:17] (03PS4) 10Gehel: maps: isolate maps-test2003 and reimage it to stretch [puppet] - 10https://gerrit.wikimedia.org/r/442258 (https://phabricator.wikimedia.org/T198290) [08:19:29] (03CR) 10Gehel: [C: 032] maps: isolate maps-test2003 and reimage it to stretch [puppet] - 10https://gerrit.wikimedia.org/r/442258 (https://phabricator.wikimedia.org/T198290) (owner: 10Gehel) [08:32:14] (03PS1) 10Jcrespo: mariadb: Depool es1016 for reimage to stretch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442792 [08:35:34] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1016 for reimage to stretch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442792 (owner: 10Jcrespo) [08:36:48] (03Merged) 10jenkins-bot: mariadb: Depool es1016 for reimage to stretch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442792 (owner: 10Jcrespo) [08:37:44] (03PS1) 10Volans: debmonitor: fix trusty crontab redirection [puppet] - 10https://gerrit.wikimedia.org/r/442793 (https://phabricator.wikimedia.org/T191300) [08:38:15] (03CR) 10jenkins-bot: mariadb: Depool es1016 for reimage to stretch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442792 (owner: 10Jcrespo) [08:39:42] (03CR) 10Volans: [C: 032] debmonitor: fix trusty crontab redirection [puppet] - 10https://gerrit.wikimedia.org/r/442793 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [08:40:32] (03PS2) 10Vgutierrez: varnishkafka: Enable TLS signature algorithms and curves lists config [puppet] - 10https://gerrit.wikimedia.org/r/440544 (https://phabricator.wikimedia.org/T182993) [08:40:34] (03PS1) 10Vgutierrez: varnishkafka: Set TLS curves list and sigalgs list for cache::misc [puppet] - 10https://gerrit.wikimedia.org/r/442794 (https://phabricator.wikimedia.org/T182993) [08:41:44] (03PS1) 10Jcrespo: mariadb: Prepare for reimage of es1016 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/442795 [08:46:04] !log aborrero@labtestnet2001:~ 7s 130 $ sudo service nova-spiceproxy stop # daemon in infinite respawning loop [08:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:41] (03CR) 10Vgutierrez: "pcc shows (mostly) no changes: https://puppet-compiler.wmflabs.org/compiler02/11600/" [puppet] - 10https://gerrit.wikimedia.org/r/440544 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [08:46:44] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1016 (duration: 01m 04s) [08:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:40] (03PS1) 10KartikMistry: lttoolbox: New upstream release [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/442798 (https://phabricator.wikimedia.org/T197559) [08:47:49] (03CR) 10jerkins-bot: [V: 04-1] lttoolbox: New upstream release [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/442798 (https://phabricator.wikimedia.org/T197559) (owner: 10KartikMistry) [08:48:44] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/442798 (https://phabricator.wikimedia.org/T197559) (owner: 10KartikMistry) [08:50:15] (03PS2) 10DCausse: Add cirrussearch settings for wikibase (1.5/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442317 (https://phabricator.wikimedia.org/T182717) [08:50:17] (03PS2) 10DCausse: Add cirrussearch settings for wikibase (2/3) (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442318 (https://phabricator.wikimedia.org/T182717) [08:50:19] (03PS9) 10DCausse: Add cirrussearch settings for wikibase (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441057 (https://phabricator.wikimedia.org/T182717) [08:50:30] (03CR) 10Vgutierrez: "pcc shows no changes in upload and text nodes, and the expected changes in misc: https://puppet-compiler.wmflabs.org/compiler02/11601/" [puppet] - 10https://gerrit.wikimedia.org/r/442794 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [08:51:43] (03CR) 10Elukey: [C: 031] varnishkafka: Enable TLS signature algorithms and curves lists config [puppet] - 10https://gerrit.wikimedia.org/r/440544 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [08:52:26] (03CR) 10Elukey: [C: 031] varnishkafka: Set TLS curves list and sigalgs list for cache::misc [puppet] - 10https://gerrit.wikimedia.org/r/442794 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [08:54:35] (03PS3) 10Vgutierrez: varnishkafka: Enable TLS signature algorithms and curves lists config [puppet] - 10https://gerrit.wikimedia.org/r/440544 (https://phabricator.wikimedia.org/T182993) [08:55:02] (03CR) 10Vgutierrez: [C: 032] varnishkafka: Enable TLS signature algorithms and curves lists config [puppet] - 10https://gerrit.wikimedia.org/r/440544 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [08:58:06] (03PS2) 10Vgutierrez: varnishkafka: Set TLS curves list and sigalgs list for cache::misc [puppet] - 10https://gerrit.wikimedia.org/r/442794 (https://phabricator.wikimedia.org/T182993) [08:58:22] (03CR) 10Vgutierrez: [C: 032] varnishkafka: Set TLS curves list and sigalgs list for cache::misc [puppet] - 10https://gerrit.wikimedia.org/r/442794 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [09:00:22] (03CR) 10Gehel: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442317 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [09:07:22] 10Operations, 10netops: Allow labnet/labnodepool/labvirt to connect to debmonitor hosts/443 - https://phabricator.wikimedia.org/T198375#4321131 (10MoritzMuehlenhoff) [09:10:56] !log Apply new TLS varnishkafka settings in cache::misc nodes - T182993 [09:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:58] T182993: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993 [09:13:37] !log mobrovac@deploy1001 Started deploy [proton/deploy@8a887b5]: Update to dceaf80 - T186748 [09:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:40] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [09:14:06] !log mobrovac@deploy1001 Finished deploy [proton/deploy@8a887b5]: Update to dceaf80 - T186748 (duration: 00m 28s) [09:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:09] (03PS1) 10Elukey: Upgrade the piwik module to matomo [puppet] - 10https://gerrit.wikimedia.org/r/442806 (https://phabricator.wikimedia.org/T192298) [09:16:45] yep s/piwik/matomo [09:16:58] https://matomo.org/ [09:20:16] !log T198377 stop nova-spiceproxy daemon in labcontrol1002.wikimedia.org [09:20:17] (03PS2) 10Elukey: Upgrade the piwik module to matomo [puppet] - 10https://gerrit.wikimedia.org/r/442806 (https://phabricator.wikimedia.org/T192298) [09:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:18] T198377: nova-spiceproxy is in an infinite respawning loop - https://phabricator.wikimedia.org/T198377 [09:24:51] Hi ops-team - Little ping about analytics deploying AQS (elukey knows) [09:25:01] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: keystone bootstrap done [puppet] - 10https://gerrit.wikimedia.org/r/442807 (https://phabricator.wikimedia.org/T196633) [09:25:44] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: eqiad1: keystone bootstrap done [puppet] - 10https://gerrit.wikimedia.org/r/442807 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [09:26:05] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11603/bohrium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/442806 (https://phabricator.wikimedia.org/T192298) (owner: 10Elukey) [09:28:01] !log joal@deploy1001 Started deploy [analytics/aqs/deploy@194ca96]: Deploying AQS pageviews-per-country ceiling-value glue code [09:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:04] !log joal@deploy1001 Finished deploy [analytics/aqs/deploy@194ca96]: Deploying AQS pageviews-per-country ceiling-value glue code (duration: 01m 03s) [09:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:25] (03CR) 10Alexandros Kosiaris: [C: 031] debmonitor: fine-tune client user creation [puppet] - 10https://gerrit.wikimedia.org/r/442246 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [09:45:52] !log joal@deploy1001 Started deploy [analytics/aqs/deploy@8eef2a9]: Deploying AQS pageviews-per-country ceiling-value glue code - Corrected [09:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:15] (03PS1) 10Arturo Borrero Gonzalez: hieradata: add profile::openstack::eqiad1::neutron::db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/442814 (https://phabricator.wikimedia.org/T196633) [09:46:39] (03CR) 10Arturo Borrero Gonzalez: [C: 032] hieradata: add profile::openstack::eqiad1::neutron::db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/442814 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [09:46:42] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] hieradata: add profile::openstack::eqiad1::neutron::db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/442814 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [09:48:40] !log joal@deploy1001 Finished deploy [analytics/aqs/deploy@8eef2a9]: Deploying AQS pageviews-per-country ceiling-value glue code - Corrected (duration: 02m 48s) [09:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:44] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable neutron in control boxes [puppet] - 10https://gerrit.wikimedia.org/r/442815 (https://phabricator.wikimedia.org/T196633) [09:52:50] (03CR) 10Filippo Giunchedi: [C: 031] cassandra: add another package version to the 2.2 list [puppet] - 10https://gerrit.wikimedia.org/r/442251 (https://phabricator.wikimedia.org/T197062) (owner: 10Elukey) [09:53:21] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: eqiad1: enable neutron in control boxes [puppet] - 10https://gerrit.wikimedia.org/r/442815 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [09:53:40] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] "Compiler says this is OK:" [puppet] - 10https://gerrit.wikimedia.org/r/442815 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [09:58:13] !log Resuming deployment of phabricator upgrade tagged release/2018-06-27/1 - details: https://phabricator.wikimedia.org/project/profile/3439/ ) [09:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:57] !log installing reportbug update from jessie 8.11 point release [10:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:06] !log running phabricator database migration [10:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:45] Hi again ops-team - Analytics deploy hadoop related scripts - No impact expected on wiki side [10:10:19] !log joal@deploy1001 Started deploy [analytics/refinery@4fc20a5]: Regular weekly deploy [10:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:55] !log phabricator database migration complete, service restored and appears stable. [10:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:17] twentyafterfour: is phab meat to be complete? [10:13:22] https://phabricator.wikimedia.org/T198341 PhabricatorDataNotAttachedException [10:13:47] (03CR) 1020after4: [C: 031] "I think this should be ready to merge." [puppet] - 10https://gerrit.wikimedia.org/r/441525 (https://phabricator.wikimedia.org/T197922) (owner: 1020after4) [10:14:16] addshore: hmm that's not right [10:14:21] :D [10:14:44] addshore: strangely, every other task I've tried so far was fine [10:14:59] twentyafterfour: also https://phabricator.wikimedia.org/T198360 [10:15:24] and https://phabricator.wikimedia.org/T136528 D:, infact, the only 3 tasks I have tried to load have failed :D [10:15:42] I'm also getting PhabricatorDataNotAttachedException. the person across from me claims everything's working for him :| [10:15:49] hi, i've just come to report the same thing, presumably [10:16:02] it looks like the favicon also changed? [10:16:05] i can't view some tasks when logged in, but they work fine in incognito window [10:16:22] ooh, yes, incog works for me too [10:16:24] phab is down [10:16:26] ? [10:16:29] ok [10:16:32] Request from xxxxx via cp1061 cp1061, Varnish XID 24768343 [10:16:32] Error: 503, Backend fetch failed at Thu, 28 Jun 2018 10:16:18 GMT [10:16:36] on Phab [10:16:59] and now [10:17:00] PhabricatorDataNotAttachedException [10:17:01] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:17:02] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:17:06] hmm I'm not sure what's up with it, I'm working on it [10:17:20] it also says: "Attempting to access attached data on PhabricatorProject, but the data is not actually attached. Before accessing attachable data on an object, you must load and attach it. [10:17:20] Data is normally attached by calling the corresponding needX() method on the Query class when the object is loaded. You can also call the corresponding attachX() method explicitly." [10:17:40] cleared cookies, logged out and back in and still get the exceptions [10:18:41] !log joal@deploy1001 Finished deploy [analytics/refinery@4fc20a5]: Regular weekly deploy (duration: 08m 21s) [10:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:09] ok the data not attached error is fixed [10:19:15] twentyafterfour: looks fixed to me [10:19:18] thanks! [10:19:22] here too. cool! [10:19:34] <_joe_> twentyafterfour: what was the problem? [10:19:47] !log hotfixing phabricator DataNotAttached bug [10:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:55] _joe_: a bug in the code I just deployed [10:20:10] wfm atm [10:21:21] <_joe_> heh [10:21:26] I'm not sure why it only happens on some tasks and not others. Or why it doesn't happen on my test instance [10:22:41] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:23:41] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:48:38] !log deployed fix for PhabricatorDataNotAttachedException - https://phabricator.wikimedia.org/rPHEX03971ea8965d3613df69833a766d1502b6d8dabb [10:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:34] (03PS1) 10ArielGlenn: generate multiple temp stub files at once for larger wikis [dumps] - 10https://gerrit.wikimedia.org/r/442828 (https://phabricator.wikimedia.org/T196063) [11:34:54] (03CR) 10jerkins-bot: [V: 04-1] generate multiple temp stub files at once for larger wikis [dumps] - 10https://gerrit.wikimedia.org/r/442828 (https://phabricator.wikimedia.org/T196063) (owner: 10ArielGlenn) [12:00:11] (03PS1) 1020after4: Phabricator: Use mysqlnd [puppet] - 10https://gerrit.wikimedia.org/r/442829 [12:02:03] (03PS2) 10ArielGlenn: generate multiple temp stub files at once for larger wikis [dumps] - 10https://gerrit.wikimedia.org/r/442828 (https://phabricator.wikimedia.org/T196063) [12:04:42] !log installing patch security updates [12:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:29] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T197707 (10Cmjohnson) The disk has been swapped with a 2TB disk. [12:07:59] 10Operations, 10Traffic, 10Goal: Establish timeline and methodology for upcoming deprecation of non-forward-secret ciphers and TLSv1.0 - https://phabricator.wikimedia.org/T192559 (10Vgutierrez) Our [[ https://grafana.wikimedia.org/dashboard/db/tls-ciphersuite-explorer?panelId=2&fullscreen&orgId=1&from=now-30... [12:08:38] !log akosiaris@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,service=.*,cluster=scb,name=scb1002 [12:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:21] PROBLEM - Host scb1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:13:11] ACKNOWLEDGEMENT - Host scb1002 is DOWN: PING CRITICAL - Packet loss = 100% alexandros kosiaris memory dimm issue. https://phabricator.wikimedia.org/T196901 [12:13:53] 10Operations, 10Discovery, 10Discovery-Search: migrate elasticsearch cirrus cluster to RAID0 - https://phabricator.wikimedia.org/T198391 (10Gehel) [12:14:47] twentyafterfour: I am seeing: [12:14:49] 17 notifications about objects which no longer exist or which you can no longer see were discarded. [12:14:59] That is new and I have never seen that [12:16:21] Hmm seems to have gone but it’s now showing only three recent notifications (otherwise I have to click to view all notifications) [12:23:51] RECOVERY - Host scb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [12:24:30] 10Operations, 10Puppet, 10puppet-compiler, 10User-herron: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438 (10aborrero) [12:30:40] (03PS1) 10Alexandros Kosiaris: grafana: Name the grafana-admin key correctly [puppet] - 10https://gerrit.wikimedia.org/r/442835 [12:34:33] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable glance in control boxes [puppet] - 10https://gerrit.wikimedia.org/r/442836 (https://phabricator.wikimedia.org/T196633) [12:36:37] (03PS1) 10Arturo Borrero Gonzalez: hieradata: add profile::openstack::eqiad1::glance::db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/442837 (https://phabricator.wikimedia.org/T196633) [12:37:25] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=.*,cluster=scb,name=scb1002 [12:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:42] !log repool scb1002 T196901 [12:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:45] T196901: Replace memory bank on scb1002 - https://phabricator.wikimedia.org/T196901 [12:38:11] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] hieradata: add profile::openstack::eqiad1::glance::db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/442837 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:39:53] (03CR) 10Alexandros Kosiaris: [C: 032] grafana: Name the grafana-admin key correctly [puppet] - 10https://gerrit.wikimedia.org/r/442835 (owner: 10Alexandros Kosiaris) [12:40:30] (03CR) 10Rush: [C: 032] openstack: eqiad1: enable glance in control boxes [puppet] - 10https://gerrit.wikimedia.org/r/442836 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:41:34] (03CR) 10Rush: [C: 032] "fyi https://gerrit.wikimedia.org/r/c/operations/puppet/+/440147" [puppet] - 10https://gerrit.wikimedia.org/r/442836 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:41:48] (03CR) 10Arturo Borrero Gonzalez: [V: 032] "Compiler is happy:" [puppet] - 10https://gerrit.wikimedia.org/r/442836 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:46:26] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable neutron in control boxes [puppet] - 10https://gerrit.wikimedia.org/r/442815 (https://phabricator.wikimedia.org/T196633) [12:49:10] !log stop hadoop daemons on analytics1032 + shutdown to swap BBU -T194234 [12:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:12] T194234: anaytics1032's BBU is not working correctly - https://phabricator.wikimedia.org/T194234 [12:49:15] (03PS3) 10Filippo Giunchedi: WIP grafana: host overview dashboard as code [puppet] - 10https://gerrit.wikimedia.org/r/442301 (https://phabricator.wikimedia.org/T178690) [12:49:42] (03CR) 10jerkins-bot: [V: 04-1] WIP grafana: host overview dashboard as code [puppet] - 10https://gerrit.wikimedia.org/r/442301 (https://phabricator.wikimedia.org/T178690) (owner: 10Filippo Giunchedi) [12:51:27] PROBLEM - puppet last run on labcontrol1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 40 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[neutron-common],File[/etc/neutron/original] [12:53:18] !log installing blktrace update from jessie 8.11 point release [12:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:30] (03PS1) 10Rush: labstore: notes in nfs-manage for failover [puppet] - 10https://gerrit.wikimedia.org/r/442838 (https://phabricator.wikimedia.org/T157478) [12:53:37] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron-common: install from jessie-backports if running mitaka [puppet] - 10https://gerrit.wikimedia.org/r/442839 (https://phabricator.wikimedia.org/T196633) [12:53:58] jouncebot: next [12:53:58] In 0 hour(s) and 6 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180628T1300) [12:54:25] (03CR) 10Rush: [C: 032] labstore: notes in nfs-manage for failover [puppet] - 10https://gerrit.wikimedia.org/r/442838 (https://phabricator.wikimedia.org/T157478) (owner: 10Rush) [12:55:07] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: neutron-common: install from jessie-backports if running mitaka [puppet] - 10https://gerrit.wikimedia.org/r/442839 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:55:20] (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron-common: install from jessie-backports if running mitaka [puppet] - 10https://gerrit.wikimedia.org/r/442839 (https://phabricator.wikimedia.org/T196633) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180628T1300). [13:00:04] raynor, MatmaRex, and dcausse: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] !log installing bwm-ng update from jessie 8.11 point release [13:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:17] hi [13:00:50] present [13:01:19] o/ [13:01:31] o/ [13:01:43] raynor and dcausse: you are deployers, rigth? want to deploy your own commits? [13:01:47] RECOVERY - puppet last run on labcontrol1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:01:49] !log upload matomo (new Piwik) 3.5.1-1 to jessie-wikimedia [13:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:58] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:02:01] zeljkof: sure I can [13:02:04] yup, I can do that [13:02:14] I can go last as I don't have too much experience yet [13:02:22] and definitely it will take me the longest ;) [13:02:27] PROBLEM - Host analytics1032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:02:36] raynor and dcausse: anybody wants to deploy MatmaRex's patch too? :) [13:02:47] (I can do it, just asking) [13:02:57] mine can take some time but I can go first if noone objects [13:02:59] I'll watch ;) [13:03:02] I can deploy MatmaRex one [13:03:47] ok, then we are all set, dcausse you are the main swatter today, let MatmaRex know when you are deploying his patch, and let raynor know when it's his turn :) [13:04:00] I am around if anybody needs me [13:04:09] MatmaRex: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/442832 got -1 from jenkins [13:04:35] dcausse: it should be harmless, CI has been having some issues since yesterday and occasionally jobs time out [13:04:44] 10Operations, 10Traffic, 10Goal: Establish timeline and methodology for upcoming deprecation of non-forward-secret ciphers and TLSv1.0 - https://phabricator.wikimedia.org/T192559 (10BBlack) Going a bit beyond the explicit scope of this ticket, there are really a few different legacy-support risks we'd like t... [13:04:49] note how it took exactly 30 minutes to fail [13:05:05] (bad news is, we might be waiting a long time for changes to merge) [13:05:12] :/ [13:05:33] (see https://phabricator.wikimedia.org/T198348) [13:06:03] so what's the plan C+2/V+2 ? [13:06:26] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable glance in control boxes [puppet] - 10https://gerrit.wikimedia.org/r/442836 (https://phabricator.wikimedia.org/T196633) [13:06:28] (03PS1) 10Vgutierrez: varnishkafka: Set TLS curves list and sigalgs list for cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/442840 (https://phabricator.wikimedia.org/T182993) [13:06:43] just C+2, it will run the tests again, and hopefully they'll pass [13:06:58] ok will C+2 and deploy my patches in the meantime [13:07:16] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: eqiad1: enable glance in control boxes [puppet] - 10https://gerrit.wikimedia.org/r/442836 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [13:07:18] (03CR) 10Elukey: [C: 031] varnishkafka: Set TLS curves list and sigalgs list for cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/442840 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [13:07:20] and if it takes more than 10 minutes or so, then yeah, you'll have to V+2 and merge, i guess [13:07:34] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442317 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:07:38] RECOVERY - Host analytics1032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.52 ms [13:07:59] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T197707 (10Marostegui) ``` root@dbstore1002:~# megacli -PDRbld -ShowProg -PhysDrv [32:5] -aALL Rebuild Progress on Device at Enclosure 32, Slot 5 Completed 1% in 63 Minutes. ``` [13:08:31] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T197707 (10elukey) Thanks @Marostegui ! [13:08:37] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [13:08:48] (03Merged) 10jenkins-bot: Add cirrussearch settings for wikibase (1.5/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442317 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:09:04] (03CR) 10jenkins-bot: Add cirrussearch settings for wikibase (1.5/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442317 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:10:14] 10Operations, 10ops-eqiad: anaytics1032's BBU is not working correctly - https://phabricator.wikimedia.org/T194234 (10elukey) Looks good! ``` elukey@analytics1032:~$ sudo megacli -AdpBbuCmd -GetBbuStatus -aALL BBU status for Adapter: 0 BatteryType: BBU Voltage: 3966 mV Current: 161 mA Temperature: 40 C Batt... [13:10:52] (03CR) 10Vgutierrez: [C: 032] varnishkafka: Set TLS curves list and sigalgs list for cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/442840 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [13:11:10] (03PS2) 10Vgutierrez: varnishkafka: Set TLS curves list and sigalgs list for cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/442840 (https://phabricator.wikimedia.org/T182993) [13:11:11] 10Operations, 10ops-eqiad: anaytics1032's BBU is not working correctly - https://phabricator.wikimedia.org/T194234 (10elukey) 05Open>03Resolved [13:11:19] (03PS1) 10Arturo Borrero Gonzalez: openstack: glance: install from jessie-backports if running mitaka [puppet] - 10https://gerrit.wikimedia.org/r/442841 (https://phabricator.wikimedia.org/T196633) [13:11:48] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:12:08] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: glance: install from jessie-backports if running mitaka [puppet] - 10https://gerrit.wikimedia.org/r/442841 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [13:12:19] (03PS2) 10Arturo Borrero Gonzalez: openstack: glance: install from jessie-backports if running mitaka [puppet] - 10https://gerrit.wikimedia.org/r/442841 (https://phabricator.wikimedia.org/T196633) [13:13:00] !log dcausse@deploy1001 Synchronized ./wmf-config/WikibaseSearchSettings.php: Add cirrussearch settings for wikibase (1.5/3) (duration: 00m 56s) [13:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:12] deploying my second patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/442318/ [13:13:29] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442318 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:14:17] PROBLEM - puppet last run on labcontrol1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[glance] [13:14:31] !log Apply new TLS varnishkafka settings in cache::upload nodes - T182993 [13:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:33] T182993: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993 [13:14:39] (03Merged) 10jenkins-bot: Add cirrussearch settings for wikibase (2/3) (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442318 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:16:18] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [13:18:56] !log dcausse@deploy1001 Synchronized ./wmf-config/: Add cirrussearch settings for wikibase (2/3) (take 2) (duration: 00m 58s) [13:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:18] RECOVERY - puppet last run on labcontrol1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:19:22] (03CR) 10jenkins-bot: Add cirrussearch settings for wikibase (2/3) (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442318 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:19:34] deploying my third patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/441057/ [13:19:38] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:19:53] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441057 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:20:49] (03PS4) 10Elukey: cassandra: add another package version to the 2.2 list [puppet] - 10https://gerrit.wikimedia.org/r/442251 (https://phabricator.wikimedia.org/T197062) [13:21:05] (03Merged) 10jenkins-bot: Add cirrussearch settings for wikibase (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441057 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:21:19] (03CR) 10jerkins-bot: [V: 04-1] cassandra: add another package version to the 2.2 list [puppet] - 10https://gerrit.wikimedia.org/r/442251 (https://phabricator.wikimedia.org/T197062) (owner: 10Elukey) [13:22:52] (03CR) 10Ottomata: ":D" [puppet] - 10https://gerrit.wikimedia.org/r/440544 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [13:24:11] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993 (10Ottomata) Woo hoo! Annnnd soon we disable IPSec?! :D [13:25:02] (03CR) 10jenkins-bot: Add cirrussearch settings for wikibase (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441057 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:25:06] !log dcausse@deploy1001 Synchronized ./wmf-config/Wikibase-production.php: Add cirrussearch settings for wikibase (3/3) (duration: 00m 56s) [13:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:25] ok I'm done with my patches [13:25:46] still waiting for CI on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/442832 :( [13:29:34] (03CR) 10Elukey: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11608/" [puppet] - 10https://gerrit.wikimedia.org/r/442251 (https://phabricator.wikimedia.org/T197062) (owner: 10Elukey) [13:31:35] it took forever to get going, but it's actually running tests now [13:32:18] ok [13:36:15] 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685 (10Cmjohnson) [13:38:06] dcausse: ugh, well, it failed due to timing out [13:38:12] :( [13:38:36] 10Operations, 10SRE-Access-Requests: WMF-NDA-Request for User:Braveheart - https://phabricator.wikimedia.org/T198190 (10Braveheart) Hi Nuria! I'd love to look at the geoeditor reports - when do you expect them to be published? I assume I still need LDAP access for these datasets? Best, Philip [13:38:56] zeljkof: is it OK to C+2/V+2 when jenkins is timing out and the patch looks harmless? [13:39:16] dcausse: it's up to deployed to decide :D [13:39:28] meh :) [13:39:42] dcausse: if you are reasonable sure it will not break stuff and please monitor the logs for at least a few minutes after the deploy [13:39:43] dcausse: haha, actually, it looks like different jobs timed out in the "Main test build" and "Gate pipeline build". so it does actually pass them all, at least sometimes ;) [13:43:29] MatmaRex: it's live on mwdebug1002 [13:43:56] ok, testing [13:44:37] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] [13:46:04] dcausse: looks fine! [13:46:11] MatmaRex: ok deploying [13:46:11] (03PS5) 10Muehlenhoff: Enable microcode for all database roles [puppet] - 10https://gerrit.wikimedia.org/r/442269 (https://phabricator.wikimedia.org/T127825) [13:46:27] (the page took like a minute to load the first time) [13:48:01] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442852 [13:48:07] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442852 [13:48:08] (03CR) 10Muehlenhoff: [C: 032] Enable microcode for all database roles [puppet] - 10https://gerrit.wikimedia.org/r/442269 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [13:49:14] !log downgrade cassadra and cassandra-tools from 2.2.6-wmf5 to 2.2.6-wmf3 in jessie-wikimedia component/cassandra22 - T197062 [13:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:17] T197062: Upgrade Cassandra on AQS to 2.2.6-wmf5 - https://phabricator.wikimedia.org/T197062 [13:49:23] !log dcausse@deploy1001 Synchronized ./php-1.32.0-wmf.10/includes/htmlform/: Allow overloading of getLabel() with return ' ' (duration: 00m 59s) [13:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:32] MatmaRex: done [13:49:48] raynor: I'm done [13:49:54] sorry for the delay :( [13:50:18] thanks. looks good in production [13:50:57] no worries [13:51:34] dcausse: Is swat done? [13:51:44] nope, I need to swat one more thing [13:51:44] marostegui: raynor has one more patch to submit [13:51:50] (03PS1) 10Rush: labstore: nfs-mount-manager add list, all, and refine help [puppet] - 10https://gerrit.wikimedia.org/r/442853 [13:51:56] Ah cool :) [13:52:02] swatting https://gerrit.wikimedia.org/r/#/c/442170/ [13:53:57] (03PS2) 10Rush: labstore: nfs-mount-manager add list, all, and refine help [puppet] - 10https://gerrit.wikimedia.org/r/442853 [13:58:59] (03PS3) 10Rush: labstore: nfs-mount-manager add list, all, and refine help [puppet] - 10https://gerrit.wikimedia.org/r/442853 [13:59:37] it's soo slow ;/ [14:00:26] raynor: yes CI is struggling :(, I had to force merge the last patch [14:01:35] (03CR) 10Rush: [C: 032] labstore: nfs-mount-manager add list, all, and refine help [puppet] - 10https://gerrit.wikimedia.org/r/442853 (owner: 10Rush) [14:03:01] zeljkof: https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/6238/console >> 13:55:14 npm ERR! registry error parsing json [14:03:08] is it something we should worry? [14:04:07] raynor: probably! I think hashar is working on it https://phabricator.wikimedia.org/T198348 [14:05:05] ok, zeljkof I think I need some help ;/ [14:05:41] I merged the patch and I don't see it on deploy1001 - the patch is to merge to wmf/1.32.0-wmf.10 [14:05:56] not master, maybe because of that I don't see it if I do `git fetch` ? [14:06:10] !log downgrading cassandra to 2.2.6-wmf3 on maps-test2001 (it should never have been upgraded) [14:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:29] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993 (10Vgutierrez) >>! In T182993#4321845, @Ottomata wrote: > Woo hoo! > > Annnnd soon we disable IPSec?! :D As soon as we rollout this on cache::... [14:06:33] raynor: yes, there is slightly different steps for backports [14:06:46] raynor: looking up docs [14:06:46] also, the change is for Vector skin [14:06:47] not core [14:07:07] 10Operations, 10ops-eqiad, 10Cloud-VPS: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10Cmjohnson) [14:07:10] hm, not sure I've ever done skin, but it should be similar to extension... [14:07:38] yup, I also think it's the same [14:08:23] 10Operations, 10ops-eqiad, 10Cloud-VPS: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10Cmjohnson) i was able to relocate a few servers in d2 to make room for the new disk shelf (LS1007). For LS1006, I just removed 2 decom'd servers from u24 and 25... [14:09:43] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:10:19] raynor: so these are the steps [14:10:20] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_2:_get_the_code_on_the_deployment_host [14:10:32] I do have a simplified version, I think, will create a phab paste [14:11:29] ok, got it, thanks [14:11:34] raynor: so this is what I do https://phabricator.wikimedia.org/P7315 [14:12:02] let me know if you have questions [14:19:42] ok,I have code in vector, code is up to date [14:19:50] now if I do scap pull on mwdebug1002 nothing changes [14:19:57] 10Operations, 10ops-eqiad: mw1239 correctable memory errors - https://phabricator.wikimedia.org/T198398 (10fgiunchedi) [14:20:33] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 12 ge 4 Filippo Giunchedi https://phabricator.wikimedia.org/T198398 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad%2520prometheus%252Fops [14:21:12] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [14:24:32] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:26:22] !log CI jobs running npm might suffer from a 10 minutes delay since June 27th | T198348 [14:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:25] T198348: CI jobs takes too long / instances overloaded - https://phabricator.wikimedia.org/T198348 [14:29:04] !log installing ghostscript security updates [14:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:39] (03CR) 10Filippo Giunchedi: "I couldn't find a production host that runs striker, according to puppet's manifest comment that ought to be labtestweb2001 but PCC disagr" [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [14:31:28] PROBLEM - Host ms-be1036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:32:03] that's known, downtime expired perhaps [14:35:23] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10Cmjohnson) [14:36:40] zeljkof: can I deploy? https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/442852/ [14:37:07] marostegui: just a sec, raynor is finishing up something [14:37:14] Cool! [14:39:43] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10Cmjohnson) [14:39:53] (03PS3) 10Elukey: Upgrade the piwik module to matomo [puppet] - 10https://gerrit.wikimedia.org/r/442806 (https://phabricator.wikimedia.org/T192298) [14:40:03] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10Cmjohnson) disk arrays are racked in D2. [14:40:57] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [14:41:44] (03CR) 10Elukey: [C: 032] Upgrade the piwik module to matomo [puppet] - 10https://gerrit.wikimedia.org/r/442806 (https://phabricator.wikimedia.org/T192298) (owner: 10Elukey) [14:41:57] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [14:44:00] marostegui, almost done [14:44:47] great, I will wait for it :) [14:46:07] (03PS1) 10Filippo Giunchedi: statsite: deprecate Diamond udp collector [puppet] - 10https://gerrit.wikimedia.org/r/442865 (https://phabricator.wikimedia.org/T183454) [14:46:40] !log upgrade piwik 3.2.1 to matomo (new name/package) 3.5.1 - T192298 [14:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:42] T192298: Update piwik to latest stable - https://phabricator.wikimedia.org/T192298 [14:46:59] (03PS1) 10Urbanecm: Add sat to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/442867 (https://phabricator.wikimedia.org/T198400) [14:49:57] !log pmiazga@deploy1001 Synchronized php-1.32.0-wmf.10/skins/Vector/components/watchstar.less: SWAT: [[gerrit:442170|Use exactly calculated value to work around a Chrome bug (T196610)]] (duration: 01m 00s) [14:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:59] T196610: Star in tab bar disappears after adding page to watchlist in Chrome - https://phabricator.wikimedia.org/T196610 [14:50:52] !log EU SWAT finished [14:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:07] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [14:51:16] SWAT is done, sorry for taking so long, I had problems with testing the patch ;/ [14:51:29] marostegui: you can proceed [14:51:34] Thanks [14:52:41] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442852 (owner: 10Marostegui) [14:52:43] 10Operations, 10ops-eqdfw, 10netops: eqdfw: Patch GTT cross-connect - https://phabricator.wikimedia.org/T194515 (10ayounsi) 05Open>03Resolved The LOA was incorrect, Equinix moved it to the proper one and link is up. [14:53:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442852 (owner: 10Marostegui) [14:54:27] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:55:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099:3318 after alter table (duration: 00m 58s) [14:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:12] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442869 (https://phabricator.wikimedia.org/T191316) [14:58:03] (03PS1) 10Rush: WIP labstore: switch labstore1005 to primary in pair [puppet] - 10https://gerrit.wikimedia.org/r/442870 (https://phabricator.wikimedia.org/T187962) [14:58:42] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442852 (owner: 10Marostegui) [14:59:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442869 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [15:00:06] (03PS1) 10Urbanecm: Initial configuration for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442871 (https://phabricator.wikimedia.org/T198400) [15:00:23] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442869 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [15:01:15] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442871 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [15:01:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101:3318 for alter table (duration: 00m 57s) [15:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:53] !log Deploy schema change on db1101:3318 T191316 T192926 T89737 T195193 [15:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:57] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [15:01:57] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [15:01:58] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [15:01:58] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [15:03:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442869 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [15:07:17] PROBLEM - puppet last run on labstore1007 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 4 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/srv/dumps/xmldatadumps/public/other/mediacounts/readme.html],File[/srv/dumps/xmldatadumps/public/other/pageviews/readme.html],File[/srv/dumps/xmldatadumps/public/other/misc] [15:08:10] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) @MoritzMuehlenhoff please see below for the out put you requested {F23057592} {F23057594} [15:10:17] PROBLEM - HP RAID on labstore1007 is CRITICAL: CRITICAL: Slot 1: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK --- Slot 3: bad transfer speed: 1E:2:1(12.0Gbps, Unknown), 1E:2:2(12.0Gbps, Unknown), 1E:2:3(12.0Gbps, Unknown), 1E:2:4(12.0Gbps, Unknown), 1E:2:5(12.0Gbps, Unknown), 1E:2:6(12.0Gbps, Unknown), 1E:2:7(12.0Gbps, [15:10:17] 2.0Gbps, Unknown), 1E:2:9(12.0Gbps, Unknown), 1E:2:10(12.0Gbps, Unknown), 1E:2:11(12.0Gbps, Unknown), 1E:2:12(12.0Gbps, Unknown) - OK: 1E:1:1, 1E:1:3, 1E:1:5, 1E:1:7, 1E:1:9, 1E:1:11, 1E:2:1, 1E:2:2, 1E:2:3, 1E:2:4, 1E:2:5, 1E:2:6, 1E:2:7, 1E:2:8, 1E:2:9, 1E:2:10, 1E:2:11, 1E:2:12 - Failed: 1E:1:2, 1E:1:4, 1E:1:6, 1E:1:8, 1E:1:10, 1E:1:12 - Controller: OK - Battery/Capacitor: OK [15:10:19] ACKNOWLEDGEMENT - HP RAID on labstore1007 is CRITICAL: CRITICAL: Slot 1: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK --- Slot 3: bad transfer speed: 1E:2:1(12.0Gbps, Unknown), 1E:2:2(12.0Gbps, Unknown), 1E:2:3(12.0Gbps, Unknown), 1E:2:4(12.0Gbps, Unknown), 1E:2:5(12.0Gbps, Unknown), 1E:2:6(12.0Gbps, Unknown), 1E:2:7(1 [15:10:20] 1E:2:8(12.0Gbps, Unknown), 1E:2:9(12.0Gbps, Unknown), 1E:2:10(12.0Gbps, Unknown), 1E:2:11(12.0Gbps, Unknown), 1E:2:12(12.0Gbps, Unknown) - OK: 1E:1:1, 1E:1:3, 1E:1:5, 1E:1:7, 1E:1:9, 1E:1:11, 1E:2:1, 1E:2:2, 1E:2:3, 1E:2:4, 1E:2:5, 1E:2:6, 1E:2:7, 1E:2:8, 1E:2:9, 1E:2:10, 1E:2:11, 1E:2:12 - Failed: 1E:1:2, 1E:1:4, 1E:1:6, 1E:1:8, 1E:1:10, 1E:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https: [15:10:20] edia.org/T198407 [15:10:23] 10Operations, 10SRE-Access-Requests: WMF-NDA-Request for User:Braveheart - https://phabricator.wikimedia.org/T198190 (10Nuria) @Braveheart: yes, you will need LDAP access and a term for which is granted (we do not grant unbounded access) . We have also given out edited versions of those reports w/o LDAP to thi... [15:10:25] 10Operations, 10ops-eqiad: Degraded RAID on labstore1007 - https://phabricator.wikimedia.org/T198407 (10ops-monitoring-bot) [15:11:57] PROBLEM - Device not healthy -SMART- on labstore1006 is CRITICAL: cluster=misc device={cciss,14,cciss,15,cciss,16,cciss,17,cciss,18,cciss,19,cciss,20,cciss,21,cciss,22,cciss,23} instance=labstore1006:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1006&var-datasource=eqiad%2520prometheus%252Fops [15:12:47] 10Operations, 10ops-eqiad, 10DNS, 10Traffic: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10Cmjohnson) [15:13:04] (03PS2) 10Urbanecm: Initial configuration for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442871 (https://phabricator.wikimedia.org/T198400) [15:14:10] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442871 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [15:15:01] 10Operations, 10ops-eqiad: Degraded RAID on labstore1007 - https://phabricator.wikimedia.org/T198407 (10chasemp) a:03Cmjohnson I don't quite understand this. Is this trying to say 6 failed drives? [15:15:08] 10Operations, 10ops-eqiad: Degraded RAID on labstore1007 - https://phabricator.wikimedia.org/T198407 (10chasemp) p:05Triage>03High [15:15:46] (03PS3) 10Urbanecm: Initial configuration for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442871 (https://phabricator.wikimedia.org/T198400) [15:16:14] 10Operations, 10ops-eqiad: Degraded RAID on labstore1007 - https://phabricator.wikimedia.org/T198407 (10chasemp) and labstore1006 as well? [from irc] ```PROBLEM - Device not healthy -SMART- on labstore1006 is CRITICAL: cluster=misc device={cciss,14,cciss,15,cciss,16,cciss,17,cciss,18,cciss,19,cciss,20,cciss,... [15:16:28] 10Operations, 10ops-eqiad: Degraded RAID on labstore1007 - https://phabricator.wikimedia.org/T198407 (10chasemp) @Volans can you help make sense of this? [15:17:11] chasemp: wow :) [15:17:39] volans: I have a guess that cmjohnson1 is adding new shelves here and it's caushing the raid monitoring to freak out, but I'm really not sure [15:17:47] I have a meeting in 3 fyi [15:18:24] 10Operations, 10ops-eqiad: Degraded RAID on labstore1007 - https://phabricator.wikimedia.org/T198407 (10chasemp) possibly related to {T196651}? [15:18:36] might be, has so many things wrong, if you're mangling with the host the best suggestion is to disable event handler in Icinga [15:18:45] for the HP RAID [15:18:48] check for that host [15:19:09] and re-enable it once done [15:19:19] I'm not doing anything w/ it today but ack I wonder if cmjohnson1 is [15:19:51] for that I've no more info than you ;) [15:20:52] heard [15:21:03] apergos: fyi T198407 and T196651, I'm not sure what's going on [15:21:03] T198407: Degraded RAID on labstore1007 - https://phabricator.wikimedia.org/T198407 [15:21:03] T196651: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 [15:21:07] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [15:21:20] chasemp: ? [15:21:31] chasemp: Bunch of IO errors on dmesg [15:21:48] crap [15:22:07] Maybe cmjohnson1 pulling out disks [15:22:09] it seems crazy it would hit both servers at hte same time unless it was related to connecting new shelves [15:22:19] indeed [15:22:48] that explains the cron email about a read-only filesystem I just got (labstore1007) [15:23:00] (03PS1) 10Volans: Improve validation on host package updates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442876 (https://phabricator.wikimedia.org/T191299) [15:23:42] Chasemp sorry I connected them. I thought the new shelves were powered off [15:23:51] ah there is the mystery [15:23:54] ohhhh [15:23:55] ok [15:23:56] (03CR) 10jerkins-bot: [V: 04-1] Improve validation on host package updates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442876 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [15:24:27] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:24:53] I have to hop into a meeting apergos and cmjohnson1, thanks (we have a bit maint in 40 minutes) [15:25:01] so do I [15:31:47] (03CR) 10Alexandros Kosiaris: "Not bad at all as a first step. Pretty good in fact!" [puppet] - 10https://gerrit.wikimedia.org/r/442301 (https://phabricator.wikimedia.org/T178690) (owner: 10Filippo Giunchedi) [15:31:59] apergos I disconnected the new disk shelves [15:32:07] per chasemp request [15:32:07] PROBLEM - HP RAID on labstore1006 is CRITICAL: CRITICAL: Slot 1: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK --- Slot 3: Failed: 1E:1:2, 1E:1:4, 1E:1:6, 1E:1:8, 1E:1:10, 1E:1:12 - OK: 1E:1:1, 1E:1:3, 1E:1:5, 1E:1:7, 1E:1:9, 1E:1:11 - Controller: OK - Battery/Capacitor: OK [15:32:10] ACKNOWLEDGEMENT - HP RAID on labstore1006 is CRITICAL: CRITICAL: Slot 1: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK --- Slot 3: Failed: 1E:1:2, 1E:1:4, 1E:1:6, 1E:1:8, 1E:1:10, 1E:1:12 - OK: 1E:1:1, 1E:1:3, 1E:1:5, 1E:1:7, 1E:1:9, 1E:1:11 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: htt [15:32:10] kimedia.org/T198408 [15:32:14] 10Operations, 10ops-eqiad: Degraded RAID on labstore1006 - https://phabricator.wikimedia.org/T198408 (10ops-monitoring-bot) [15:32:29] I see [15:32:31] /dev/mapper/data-dumps on /srv/dumps type ext4 (ro,noatime,stripe=384,data=ordered) [15:32:34] still on labstore1007 [15:32:45] I can't really look at it right now, meeting [15:32:53] cmjohnson1: ^ I think chris is shuting down the new shelves for now apergos [15:41:33] can someone fsck or remount or whatever needs to happen over there please? [15:41:37] they are still ro [15:41:50] I'm only on labstore1007, I have no idea about the other hosts [15:43:24] 10Operations, 10Wikidata, 10monitoring, 10Patch-For-Review, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289 (10Ladsgroup) Is this done? [15:45:07] RECOVERY - Host ms-be1036 is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [15:45:33] godog ^ it's back [15:47:00] 10Operations, 10ops-eqiad: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873 (10Cmjohnson) I pushed the schedule and the HP tech came today. The server is back online. @godog please resolve if satisfied. [15:49:17] PROBLEM - puppet last run on labstore1006 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/srv/dumps/xmldatadumps/public/other/unique_devices/readme.html],File[/srv/dumps/xmldatadumps/public/other/misc] [15:50:58] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [15:52:51] apergos: we're about to do some network maintenance on other labstores and I kind of want to ignore the 1006/1007 issues until after our window. Can you live with that? (It shouldn't be long) [15:53:09] it means rsyncs will fail for awhile [15:53:20] when is your window? [15:53:53] in 7 minutes [15:53:55] fine [15:54:12] I was steeling myelf for "oh in 6 hours' [15:54:18] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:54:34] apergos: thanks [15:54:59] thanks for letting me know/looking at it later [15:56:48] RECOVERY - Memory correctable errors -EDAC- on cp1053 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1053&var-datasource=eqiad%2520prometheus%252Fops [16:00:04] godog, moritzm, and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180628T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:20] (03PS2) 10Vgutierrez: vcl: Bump AES128-SHA pageview replacement to 10% [puppet] - 10https://gerrit.wikimedia.org/r/441804 (https://phabricator.wikimedia.org/T192555) [16:00:34] (03CR) 10Vgutierrez: [C: 032] vcl: Bump AES128-SHA pageview replacement to 10% [puppet] - 10https://gerrit.wikimedia.org/r/441804 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [16:00:37] PROBLEM - Device not healthy -SMART- on labstore1007 is CRITICAL: cluster=misc device={cciss,14,cciss,15,cciss,16,cciss,17,cciss,18,cciss,19,cciss,20,cciss,21,cciss,22,cciss,23} instance=labstore1007:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad%2520prometheus%252Fops [16:03:29] apergos: labstore1007 dmesg [16:03:32] https://www.irccloud.com/pastebin/Spm5ruwv/ [16:03:46] I'm n a meeting :-( [16:05:32] (03PS3) 10Paladox: Gerrit: Clone avatars repo into /var/www/avatars [puppet] - 10https://gerrit.wikimedia.org/r/440104 [16:06:09] (03PS4) 10Paladox: Gerrit: Clone avatars repo into /var/www/avatars [puppet] - 10https://gerrit.wikimedia.org/r/440104 [16:09:15] !log restarting Cassandra instances on restbase2005 to pick up Java security update [16:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:19] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review: Release and deploy Debmonitor (patch management software) [Technology Goal 2017-18_Q4] - https://phabricator.wikimedia.org/T191298 (10Volans) [16:09:23] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Debmonitor: deploy the service in production - https://phabricator.wikimedia.org/T191299 (10Volans) 05Open>03Resolved The service is in production and working fine. Some fine-tune will follow in separated tasks. Goal wise this is comple... [16:09:53] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review: Release and deploy Debmonitor (patch management software) [Technology Goal 2017-18_Q4] - https://phabricator.wikimedia.org/T191298 (10Volans) [16:09:58] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Debmonitor: deploy the agent across the fleet - https://phabricator.wikimedia.org/T191300 (10Volans) 05Open>03Resolved The client is in production across the whole fleet and working fine. Some fine-tune might follow in separated tasks.... [16:10:17] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review: Release and deploy Debmonitor (patch management software) [Technology Goal 2017-18_Q4] - https://phabricator.wikimedia.org/T191298 (10Volans) [16:10:44] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review: Release and deploy Debmonitor (patch management software) [Technology Goal 2017-18_Q4] - https://phabricator.wikimedia.org/T191298 (10Volans) 05Open>03Resolved The service and client are in production and working fine. Some fi... [16:10:57] (03PS1) 10Muehlenhoff: Extend access for jsamra [puppet] - 10https://gerrit.wikimedia.org/r/442881 [16:12:07] RECOVERY - Device not healthy -SMART- on labstore1006 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1006&var-datasource=eqiad%2520prometheus%252Fops [16:12:50] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: New tool to track package updates/status for hosts and images (debmonitor) - https://phabricator.wikimedia.org/T167504 (10Volans) The service and client are in production and working fine. Leaving the task open for the Docker images part. [16:13:42] (03CR) 10Muehlenhoff: [C: 032] Extend access for jsamra [puppet] - 10https://gerrit.wikimedia.org/r/442881 (owner: 10Muehlenhoff) [16:17:47] PROBLEM - Host labstore1004 is DOWN: PING CRITICAL - Packet loss = 100% [16:18:17] PROBLEM - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [16:18:55] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 356 bytes in 60.008 second response time [16:19:07] RECOVERY - Host labstore1004 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [16:19:18] ok well those are legit except it sould be returning [16:19:28] apologies this maintenance has been a bit of chaos [16:20:27] ok I am now here (meeting out) [16:20:35] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.012 second response time [16:21:53] PROBLEM - drbd service on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit drbd is inactive [16:23:47] ^ just got the page on that. am around if you need any help [16:24:12] RECOVERY - drbd service on labstore1004 is OK: OK - drbd is active [16:27:24] 10Operations, 10ops-eqiad: Degraded RAID on labstore1006 - https://phabricator.wikimedia.org/T198408 (10herron) p:05Triage>03High [16:29:47] PROBLEM - Host es1015 is DOWN: PING CRITICAL - Packet loss = 100% [16:30:01] jynus: marostegui ^ ? [16:30:32] Checking [16:30:40] I think it was going to be reimaged [16:30:43] Maybe downtime expired? [16:30:45] checking anyways [16:30:47] RECOVERY - Device not healthy -SMART- on labstore1007 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad%2520prometheus%252Fops [16:30:49] not 15 [16:30:52] ah, not 15 [16:30:53] ok [16:30:55] so depooling it [16:30:57] it could mean a site-wide outage [16:31:56] it was probably depooled automatically by mediawiki by the lb per that task by jaime is not great [16:32:06] could indeed cause issues [16:32:10] akosiaris: load balancer doesn't work [16:32:20] yeah I 've read that task [16:32:20] and less with network or hw issues [16:32:20] (03PS1) 10Marostegui: db-eqiad.php: Depool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442888 [16:32:32] I 'll gonna try the mgmt [16:32:33] jynus: ^ [16:32:58] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442888 (owner: 10Marostegui) [16:33:09] (03CR) 10Marostegui: [V: 032 C: 032] db-eqiad.php: Depool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442888 (owner: 10Marostegui) [16:33:19] 10Operations, 10ops-eqiad: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Cmjohnson) [16:33:28] (03CR) 10jenkins-bot: db-eqiad.php: Depool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442888 (owner: 10Marostegui) [16:33:30] eth0: mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 [16:33:31] Deploying [16:33:38] something networky is going on [16:34:19] cmjohnson1: any chance something happened to C2 ? [16:34:24] XioNoX: ^ [16:34:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1015 - crashed (duration: 00m 57s) [16:34:26] mediawiki connection error https://logstash.wikimedia.org/goto/5b54c3ce596239a5908c43866b151449 [16:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:37] es1015 (U15) got disconnected from the network [16:34:42] only 3000 [16:34:57] RECOVERY - Host es1015 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:34:57] so load balancer could have worked this time, too early to say [16:35:06] akosiaris: so the server is actually up then? [16:35:13] at least those are good news (so no mysql crash) [16:35:13] yup [16:35:14] looking [16:35:17] working just fine [16:35:28] Jun 28 16:27:39 es1015 kernel: [9013291.474181] tg3 0000:01:00.0 eth0: Link is down [16:35:28] Jun 28 16:34:43 es1015 kernel: [9013715.430079] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex [16:35:35] hmm [16:35:35] uh [16:35:44] i think we have a loose connection [16:37:24] It is up again [16:38:07] yeah a ping -f does not spot any missed packets [16:38:30] cmjohnson1: what was it then? [16:38:38] cable misbehaving? [16:38:55] ironically I was just on that switch moving labstore1004 [16:39:13] probably a loose rj14 jack ? [16:39:45] maybe a lose rj11? [16:39:58] haha [16:40:35] Maybe a loose coax? [16:40:40] :D [16:40:58] I will leave it depooled a bit more to make sure it is all fine [16:41:03] +1 [16:41:07] PROBLEM - puppet last run on es1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:41:10] ah it is probably [16:41:22] it would be nice to check how mediawiki behaved [16:41:22] I can tell you I am doing ping -f for a long time now on it [16:41:30] and it's nice to see no packets lost [16:41:37] but for es* servers maybe the logic is cleaner/simpler [16:41:49] plus they have way less connections [16:42:11] (the problem is not immediate when it happens, it takes some time to buildup) [16:42:25] maybe we were just fast enough [16:43:35] Could be yeah [16:43:56] it is not a 100% sure problem when it happens [16:44:18] e.g. I think it happens with DROP but not REJECT fue to network timeouts [16:44:58] (03CR) 10Bstorm: [C: 032] WIP labstore: switch labstore1005 to primary in pair [puppet] - 10https://gerrit.wikimedia.org/r/442870 (https://phabricator.wikimedia.org/T187962) (owner: 10Rush) [16:45:16] (03PS2) 10Bstorm: WIP labstore: switch labstore1005 to primary in pair [puppet] - 10https://gerrit.wikimedia.org/r/442870 (https://phabricator.wikimedia.org/T187962) (owner: 10Rush) [16:45:39] <_joe_> yes exactly that jynus [16:45:56] it is not only that [16:46:13] <_joe_> well in this case it's DROP-like behaviour [16:46:21] it requires some cache expiring and other interactions [16:46:21] <_joe_> so the problem *should* be there [16:46:29] <_joe_> heh ok [16:46:46] maybe it cannot happen on es* servers because there is not gtid wait, for example [16:48:12] moritzm: Ok to merge the commit for access for jsamra? [16:48:44] <_joe_> bstorm_: assume it is [16:48:49] ok :) [16:49:13] <_joe_> if it was something less simple I would've advised to wait [16:49:30] <_joe_> for moritzm to respond, but in this case it's safe to assume it's ok [16:49:33] Fair enough [16:49:37] That makes sense [16:49:39] <_joe_> and I get you're in the middle of a migration [16:50:36] 👍🏻 [16:51:17] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [16:54:28] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:54:54] ^ arturo :) [16:55:11] how :S [16:55:30] according to icinga, it's downtimed [16:56:21] perhaps because the alert was before the downtime [16:57:17] RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1015 bytes in 0.065 second response time [16:57:33] I just disabled notifications for all the services on the host and the host itself [16:57:55] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool es1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442891 [16:58:01] jynus: ^ going to repool [16:58:25] should we wait until tomorrow? [16:58:39] That wouldn't hurt [16:58:44] sorry, I don't know if the reason was cought [16:58:46] Let's do it [16:58:49] like a mistake or something [16:58:53] if not, it won't hur [16:59:04] the other server depooled is on a different shard [16:59:29] Yeah, let's leave it till tomorrow [16:59:51] 10Operations, 10Discovery, 10Discovery-Search: migrate elasticsearch cirrus cluster to RAID0 - https://phabricator.wikimedia.org/T198391 (10herron) p:05Triage>03Normal What are your thoughts about RAID10, RAID5(0) or even exposing each individual disk to ES an option for expansion? I am leery of RAID0 s... [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: Dear deployers, time to do the Services – Graphoid / Parsoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180628T1700). [17:01:47] ah, sorry, forgot to press ENTER in puppet-merge... [17:03:00] moritzm: :-) [17:04:28] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.235 second response time [17:04:38] ^ andrewbogott :) [17:05:17] does that mean I made it worse? [17:06:17] RECOVERY - puppet last run on es1015 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:06:40] should I downtime toolschecker? [17:07:22] andrewbogott: I thought that was recovery...and it's not my bad [17:07:25] arturo: sure please [17:07:32] we are making too much noise I think unnecessarily [17:07:41] but andrewbogott I'm not sure what is still broken will look [17:09:00] downtimed [17:09:27] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.234 second response time [17:09:51] real recovery, apologies for teh chaos. things went a little sideways^ [17:16:03] 10Operations, 10ops-eqiad: mw1239 correctable memory errors - https://phabricator.wikimedia.org/T198398 (10herron) p:05Triage>03High Is a DIMM swap on channel:1 slot:0 the action to take on this? [17:20:56] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [17:24:16] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:31:13] 10Operations, 10Discovery, 10Discovery-Search: migrate elasticsearch cirrus cluster to RAID0 - https://phabricator.wikimedia.org/T198391 (10EBernhardson) for the elasticsearch cluster, we could probably lose 3 or 4 machines before there was any thought of potential urgency. Elasticsearch can handle being pro... [17:35:56] 10Operations, 10Discovery, 10Discovery-Search: migrate elasticsearch cirrus cluster to RAID0 - https://phabricator.wikimedia.org/T198391 (10Gehel) As an example, during clsuter restarts, my standard procedure is to restart 3 nodes at a time. So we have strong evidence that loosing 3 nodes is a non issue. [17:44:28] PROBLEM - Host labstore1007 is DOWN: CRITICAL - Host Unreachable (208.80.155.106) [17:48:34] (03PS5) 10Giuseppe Lavagetto: [WIP] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) [17:48:36] (03PS1) 10Giuseppe Lavagetto: Sanitize class names for entities [software/conftool] - 10https://gerrit.wikimedia.org/r/442899 [17:49:18] RECOVERY - Host labstore1007 is UP: PING OK - Packet loss = 0%, RTA = 0.14 ms [17:49:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [17:49:58] (03CR) 10jerkins-bot: [V: 04-1] Sanitize class names for entities [software/conftool] - 10https://gerrit.wikimedia.org/r/442899 (owner: 10Giuseppe Lavagetto) [18:06:08] PROBLEM - Device not healthy -SMART- on labvirt1009 is CRITICAL: cluster=labvirt device=cciss,8 instance=labvirt1009:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labvirt1009&var-datasource=eqiad%2520prometheus%252Fops [18:14:10] (03PS1) 10Krinkle: webperf: Get graphite_host for coal::processor from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/442900 (https://phabricator.wikimedia.org/T195314) [18:20:49] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [18:23:18] (03PS1) 10Volans: manage.py: add custom command for GC [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442901 [18:24:18] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:24:25] (03CR) 10jerkins-bot: [V: 04-1] manage.py: add custom command for GC [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442901 (owner: 10Volans) [18:26:54] 10Operations, 10ops-eqiad, 10Cloud-VPS: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10chasemp) We ran into trouble here: * RAID issues reported and errors, and the /srv/dumps path was changed to ro * Chris set shelves back to before * labstore10... [18:33:12] (03CR) 10Krinkle: "Compiler failed (as expected) given I didn't add the Hiera field for this role yet, just wanted to confirm that in case it was being set i" [puppet] - 10https://gerrit.wikimedia.org/r/442900 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [18:36:09] RECOVERY - Device not healthy -SMART- on labvirt1009 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labvirt1009&var-datasource=eqiad%2520prometheus%252Fops [18:37:51] (03PS2) 10Krinkle: webperf: Get graphite_host for coal::processor from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/442900 (https://phabricator.wikimedia.org/T195314) [18:38:13] (03PS1) 10QChris: Add .gitreview [software/certcentral] - 10https://gerrit.wikimedia.org/r/442904 [18:38:16] (03CR) 10QChris: [V: 032 C: 032] Add .gitreview [software/certcentral] - 10https://gerrit.wikimedia.org/r/442904 (owner: 10QChris) [18:43:33] (03CR) 10Krinkle: "No on-disk difference for prod:" [puppet] - 10https://gerrit.wikimedia.org/r/442900 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [18:44:02] (03CR) 10Krinkle: "Diff from beta/webperf11:" [puppet] - 10https://gerrit.wikimedia.org/r/442900 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [18:48:09] RECOVERY - puppet last run on labstore1007 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:52:49] tgr: rolling the train soon. it would be great to have you around to help in case [18:53:38] i'll be watching the logs like a hawk but more eyeballs are always helpful :) [18:53:53] oh and thcipriani is helping too [18:54:25] * thcipriani raring [18:54:41] again, the plan is: group1 - commons, commons, vet vet vet, group2 [18:57:54] marxarelli: when are you starting? [18:58:23] tgr: in 2 minutes, but i can wait a bit if that means you'll be more ready [18:59:22] FWIW I'm pretty sure the patch fixes the issue we have seen. I'm not sure at all there are no other issues - the MCR patches together were 3000 lines or so. [18:59:48] I don't think we have any better way of finding out than deploying though :( [19:00:00] I can be around for an hour, maybe two [19:00:04] marxarelli: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180628T1900). [19:01:17] (03PS1) 10Dduvall: Group1 (less commons) to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442909 [19:01:34] 10Operations, 10Mail, 10monitoring, 10User-herron, 10Wikimedia-Incident: Improve outbound mail service alerting - https://phabricator.wikimedia.org/T197172 (10herron) p:05High>03Normal [19:01:43] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review, 10User-herron: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060 (10herron) p:05High>03Normal [19:02:27] tgr: right on. thanks [19:02:35] hrm did it not move the symlink? [19:03:03] thcipriani: it's already pointing to wmf.10 apparently [19:03:20] marxarelli: looks modified on deploy1001 [19:03:26] checkout git status [19:03:38] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 302 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:03:39] oh whoops. thanks for catching that! [19:03:40] :) [19:03:44] :) [19:04:04] (03PS2) 10Dduvall: Group1 (less commons) to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442909 [19:04:05] 10Operations, 10Mail, 10monitoring, 10User-herron, 10Wikimedia-Incident: Improve outbound mail service alerting - https://phabricator.wikimedia.org/T197172 (10herron) p:05Normal>03High [19:04:34] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review, 10User-herron: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060 (10herron) p:05Normal>03High [19:04:58] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 47 probes of 323 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [19:05:07] (03CR) 10Thcipriani: [C: 031] Group1 (less commons) to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442909 (owner: 10Dduvall) [19:05:11] lgtm [19:05:31] (03CR) 10Dduvall: [C: 032] Group1 (less commons) to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442909 (owner: 10Dduvall) [19:07:04] (03Merged) 10jenkins-bot: Group1 (less commons) to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442909 (owner: 10Dduvall) [19:08:48] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 8 probes of 302 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:09:51] (03CR) 10Imarlier: [C: 031] webperf: Get graphite_host for coal::processor from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/442900 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [19:09:59] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 0 probes of 323 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [19:16:06] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: Group1 (less commons) to 1.32.0-wmf.10 [19:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:48] (03CR) 10BryanDavis: "This role may be unused now outside of my test environment in the striker Cloud VPS project. The production striker deploys are now using " [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [19:17:21] !log dduvall@deploy1001 Synchronized php: Group1 (less commons) to 1.32.0-wmf.10 (duration: 00m 57s) [19:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:54] tgr, thcipriani: ^ (nothing terrible so far) [19:19:20] so far so good afaict [19:21:11] i'll give it another 5 minutes or so, and then roll to commonswiki [19:21:22] but logs look really clean [19:22:05] almost...suspiciously clean [19:22:07] :) [19:22:15] * thcipriani adds drama [19:22:16] knock on wood you jerk [19:22:57] * bd808 tosses salt over shoulder and spits 3 times to help greg-g out [19:25:02] (03PS1) 10Rush: toolforge: remove labstore1006 from dumps config [puppet] - 10https://gerrit.wikimedia.org/r/442913 [19:27:18] thcipriani: oh good, there's at least 1 "exceeded memory limit" error now, for wmf.10 :) [19:27:36] :) [19:28:07] alright. rolling out to commonswiki [19:28:19] +1 [19:29:03] (03CR) 10Rush: [C: 032] toolforge: remove labstore1006 from dumps config [puppet] - 10https://gerrit.wikimedia.org/r/442913 (owner: 10Rush) [19:29:54] (03PS1) 10Dduvall: commonswiki to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442914 [19:31:26] (03CR) 10Dduvall: [C: 032] commonswiki to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442914 (owner: 10Dduvall) [19:32:41] (03Merged) 10jenkins-bot: commonswiki to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442914 (owner: 10Dduvall) [19:33:00] 10Operations, 10netops: Allow labnet/labnodepool/labvirt to connect to debmonitor hosts/443 - https://phabricator.wikimedia.org/T198375 (10ayounsi) 05Open>03Resolved a:03ayounsi Policy added: ```lang=diff [edit firewall family inet filter labs-in4] + term debmonitor { + from { +... [19:34:23] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: commonswiki to 1.32.0-wmf.10 [19:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:57] good so far... [19:37:10] yep [19:41:49] * James_F crosses fingers and toes. [19:45:17] * marxarelli is seeing something [19:45:37] lock wait timeouts again [19:45:48] tgr, thcipriani: ^ [19:46:08] from commons [19:46:11] rolling back [19:47:09] +1 [19:47:18] (03PS1) 10Dduvall: Rollback commonswiki to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442915 [19:47:29] uh, that query is still the same wrong one [19:47:40] could the patch have gotten lost somehow? [19:48:03] i verified it was there in git log [19:48:04] sec [19:48:29] let me test it on wmf.10 [19:49:11] (03CR) 10jenkins-bot: Group1 (less commons) to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442909 (owner: 10Dduvall) [19:49:13] (03CR) 10jenkins-bot: commonswiki to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442914 (owner: 10Dduvall) [19:51:11] tgr: i screwed up. didn't actually get the patch synced :( [19:51:19] PROBLEM - Host labstore1006 is DOWN: PING CRITICAL - Packet loss = 100% [19:51:24] rolling back, then syncing, then forward [19:51:27] sheesh [19:51:50] did wmf10 go live on commons afterall? [19:52:34] DanielK_WMDE: didn't go live initially, all looked calm, then commons rolled forward, rolling back commons now [19:52:52] ic [19:53:22] and the issue was that tgr's fix wasn't deployed? sorry, i joined late [19:53:39] yes [19:54:05] marxarelli: ping me if it's synced, I'll do some testing on mwdebug1001/group0 [19:54:41] tgr: will do [19:55:08] RECOVERY - Host labstore1006 is UP: PING WARNING - Packet loss = 86%, RTA = 0.15 ms [19:55:39] RECOVERY - HP RAID on labstore1006 is OK: OK: Slot 1: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK --- Slot 3: OK: 1E:1:1, 1E:1:2, 1E:1:3, 1E:1:4, 1E:1:5, 1E:1:6, 1E:1:7, 1E:1:8, 1E:1:9, 1E:1:10, 1E:1:11, 1E:1:12 - Controller: OK - Battery/Capacitor: OK [19:56:12] waiting on sync-wikiversions to re-sync. it seems stalled [19:57:07] (03CR) 10C. Scott Ananian: Replace Tidy with RemexHtml everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442142 (https://phabricator.wikimedia.org/T175706) (owner: 10Subramanya Sastry) [19:57:28] PROBLEM - NFS on labstore1006 is CRITICAL: connect to address 208.80.154.7 and port 2049: Connection refused [19:57:29] (03PS1) 10Daniel Kinzler: MCR DNM Enable MCR write-both mode on commons beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442918 (https://phabricator.wikimedia.org/T197818) [19:57:55] what's going for Commons? https://phabricator.wikimedia.org/T198350 [19:58:06] same error messages as yesterday [19:58:58] RECOVERY - puppet last run on labstore1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:59:03] yannf: yeah, known [19:59:51] tgr: but no errors showed up on wikidata? that's surprising [20:01:16] DanielK_WMDE: yeah, not sure what to make of that [20:01:25] 10Operations, 10ops-eqiad, 10Cloud-VPS: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10chasemp) labstore1007 has been restored to service and NFS clients and web users are pointed at it (https://gerrit.wikimedia.org/r/c/operations/puppet/+/442913)... [20:02:19] tgr: maybe not enough things were being deleted... [20:02:55] why is it different from last time though? some kind of time-of-day pattern? [20:02:58] !log dduvall@deploy1001 sync-wikiversions aborted: Rollback commonswiki to 1.32.0-wmf.8 (duration: 15m 24s) [20:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:13] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: Rollback commonswiki to 1.32.0-wmf.8 (resync following ssh hang) [20:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:30] are there errors again? [20:06:15] tgr: possibly... [20:06:52] addshore: tgr found the issue, but the re-deploy accidentally went out without the fix... [20:06:57] tgr: syncing the fix now [20:07:04] aaaaaah, not great ;) [20:07:09] !log dduvall@deploy1001 Synchronized php-1.32.0-wmf.10/includes/page/WikiPage.php: Syncing table locking fix (T198350) (duration: 00m 57s) [20:07:10] * addshore goes back to eating [20:07:11] ^ nope :( [20:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:12] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment - https://phabricator.wikimedia.org/T198350 [20:07:14] not great [20:07:23] addshore: remember this? https://gerrit.wikimedia.org/r/c/mediawiki/core/+/442889/1/includes/page/WikiPage.php [20:07:27] we got it wrong :P [20:07:49] sorry, I should have tested on group0 in the first place [20:07:50] DanielK_WMDE: yes, that's actually what I was looking at this morning, but got distracted by other tasks after lunch [20:07:59] 10Operations, 10ops-eqiad, 10Cloud-VPS: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10Bstorm) Cabling information grabbed from these two documents: D3600 manual: http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c04219600-1.pdf D3000 series wiri... [20:08:39] addshore: i stared at it too, but didn't see the issue. found it hard to believe that deletions could be the problem, seemed to low volume. [20:08:44] (03CR) 10Dduvall: [C: 032] Rollback commonswiki to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442915 (owner: 10Dduvall) [20:09:11] ^ fyi, synced before pushing for review [20:09:59] (03Merged) 10jenkins-bot: Rollback commonswiki to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442915 (owner: 10Dduvall) [20:10:15] (03CR) 10jenkins-bot: Rollback commonswiki to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442915 (owner: 10Dduvall) [20:10:22] tgr: let me know when you've tested the fix [20:11:24] I'll mess up some files on mwdebug1001 [20:15:10] well i really mucked that up. strange that the error didn't surface on wikidata though [20:19:30] marxarelli: tested, works [20:19:40] did I mention that PsySH is awesome? [20:19:45] tgr: excellent! [20:21:49] let's try this again, the right way [20:24:10] (03PS1) 10Dduvall: commonswiki to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442976 [20:24:50] thcipriani: ^ [20:26:09] (03CR) 10Thcipriani: [C: 031] commonswiki to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442976 (owner: 10Dduvall) [20:26:12] (03CR) 10Dduvall: [C: 032] commonswiki to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442976 (owner: 10Dduvall) [20:26:13] well I +1'd but wikibugs is...there it is [20:27:05] (03Merged) 10jenkins-bot: commonswiki to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442976 (owner: 10Dduvall) [20:29:05] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: commonswiki to 1.32.0-wmf.10 otra vez [20:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:05] looking good this time [20:46:30] !log Rolling 1.32.0-wmf.10 to group2 following fix and successful re-deploy to group1 [20:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:03] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.10 [20:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:36] (03PS1) 10Dduvall: all wikis to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442981 [20:54:38] (03CR) 10Dduvall: [C: 032] all wikis to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442981 (owner: 10Dduvall) [20:54:48] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442981 (owner: 10Dduvall) [21:00:05] Niharika and mooeypoo: Dear deployers, time to do the PageTriage deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180628T2100). [21:00:47] marxarelli: how's things looking? [21:01:23] greg-g: things look ok [21:02:54] give it another 10 minutes and we'll call it done? [21:03:09] sounds good [21:06:37] 18:30:47 RoanKattouw: Be sure to file a task if there isn't one already. [21:07:02] Filed T198422 and T198423 [21:07:03] T198423: Linting phase in scap doesn't surface errors - https://phabricator.wikimedia.org/T198423 [21:07:03] T198422: Running scap sync-dir php-1.32.0-wmf.10 fails due to syntax error - https://phabricator.wikimedia.org/T198422 [21:07:22] (03CR) 10jenkins-bot: commonswiki to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442976 (owner: 10Dduvall) [21:07:24] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442981 (owner: 10Dduvall) [21:12:54] marxarelli: still good? [21:13:17] greg-g: still good [21:15:18] mooeypoo: all yours [21:15:25] greg-g: Thanks. [21:15:40] :) [21:20:48] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [21:24:08] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:34:05] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370 (10Krinkle) [21:34:26] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370 (10Krinkle) [21:34:35] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370 (10Krinkle) [21:36:46] moritzm: apergos: available for two quick webperf puppet patches? [21:36:56] I am so not here. [21:37:03] it is midnight 30 after a very long day [21:37:29] No problem - don't stay up for this, it can wait. [21:38:54] I could stay u. Ijust do't have any working brain cells left [21:40:05] same here, add me to reviewers and I'll have a look tomorrow [21:42:05] Thx, done. [21:43:42] (03PS4) 10Niharika29: Enable Draft namespace and AfC mode for PageTriage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441944 (https://phabricator.wikimedia.org/T198143) (owner: 10MusikAnimal) [21:43:58] (03CR) 10Niharika29: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441944 (https://phabricator.wikimedia.org/T198143) (owner: 10MusikAnimal) [21:45:27] (03Merged) 10jenkins-bot: Enable Draft namespace and AfC mode for PageTriage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441944 (https://phabricator.wikimedia.org/T198143) (owner: 10MusikAnimal) [21:52:52] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Draft namespace and AfC mode for PageTriage on testwiki T198143 (duration: 00m 53s) [21:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:56] T198143: Enable Draft namespace on testwiki - https://phabricator.wikimedia.org/T198143 [21:55:18] !log niharika29@deploy1001 Synchronized php-1.32.0-wmf.10/extensions/PageTriage/: Update extension directory (duration: 00m 51s) [21:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:51] (03PS2) 10Andrew Bogott: labtestn: use proper labtestn db password from hiera [puppet] - 10https://gerrit.wikimedia.org/r/440366 [22:05:50] (03CR) 10Andrew Bogott: [C: 032] labtestn: use proper labtestn db password from hiera [puppet] - 10https://gerrit.wikimedia.org/r/440366 (owner: 10Andrew Bogott) [22:13:52] (03PS8) 10Krinkle: profiler-labs: Use FlameGraph-compatible format for xhprof sampler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) [22:15:55] (03PS9) 10Krinkle: profiler-labs: Remove 'sampleprofiler' experiment. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) [22:16:00] (03PS10) 10Krinkle: profiler-labs: Remove 'sampleprofiler' experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) [22:24:35] (03CR) 10jenkins-bot: Enable Draft namespace and AfC mode for PageTriage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441944 (https://phabricator.wikimedia.org/T198143) (owner: 10MusikAnimal) [22:26:44] (03CR) 10Krinkle: [C: 032] "beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [22:27:59] (03Merged) 10jenkins-bot: profiler-labs: Remove 'sampleprofiler' experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [22:28:59] PROBLEM - tilerator on maps-test2003 is CRITICAL: connect to address 10.192.16.34 and port 6534: Connection refused [22:29:19] PROBLEM - Check systemd state on maps-test2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:29:59] PROBLEM - tileratorui on maps-test2003 is CRITICAL: connect to address 10.192.16.34 and port 6535: Connection refused [22:39:24] (03CR) 10jenkins-bot: profiler-labs: Remove 'sampleprofiler' experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [22:44:14] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Datasets-General-or-Unknown: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10Nemo_bis) [22:49:19] RECOVERY - Check systemd state on maps-test2003 is OK: OK - running: The system is fully operational [22:49:58] RECOVERY - tileratorui on maps-test2003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.095 second response time [22:50:08] RECOVERY - tilerator on maps-test2003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.099 second response time [22:51:48] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [22:54:59] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:56:52] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install graphite2003 - https://phabricator.wikimedia.org/T196483 (10Papaul) [22:56:55] 10Operations, 10ops-codfw, 10netops: switch port configuration for graphite2003 - https://phabricator.wikimedia.org/T198119 (10Papaul) 05Open>03Resolved a:03Papaul switch configuration done Interface Admin Link Description ge-5/0/17 up down graphite2003 [22:59:02] (03PS2) 10Krinkle: Use perftools/xhgui-collector instead of perftools/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432016 [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180628T2300). [23:00:04] bmansurov and RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:09] here [23:00:25] Here [23:00:56] I can do the deploy myself if needed, but in ~5 mins [23:01:07] seems easy to get a sticker, then [23:05:29] RoanKattouw: If possible, mooeypoo/Niharika would like you to do a full scap at the end for i18n sync of a previous deploy. [23:05:57] OK [23:06:01] James_F: RoanKattouw: It's not very urgent and can wait. [23:06:05] Until next week. [23:06:17] Sure, but let's not leave prod broken if we don't have ot. [23:06:33] I don't consider testwiki as prod. :P [23:06:51] However, until we finally delete the stupid thing, it is. [23:07:26] (03CR) 10Catrope: [C: 032] Increase Schema:CitationUsage sampling rate to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441567 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [23:08:16] OK, well the first step is waiting for Jenkins [23:08:42] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install authdns2001.wikimedia.org - https://phabricator.wikimedia.org/T196664 (10Papaul) [23:08:44] 10Operations, 10ops-codfw, 10netops: Swith port information for authdns2001 - https://phabricator.wikimedia.org/T198126 (10Papaul) 05Open>03Resolved a:03Papaul switch port configuration done Interface Admin Link Description ge-5/0/5 up down authdns2001 [edit interfaces interface-rang... [23:09:11] bmansurov: Uhh are you sure you put the right change on the Deployments page? It depends on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/440867/3 which is not merged [23:09:23] let me see, 1 sec [23:09:28] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install graphite2003 - https://phabricator.wikimedia.org/T196483 (10Papaul) [23:09:29] So right not it's 1:100, the change that's unmerged and not listed for deployment is 1:6.67, and the one you asked to deploy is 1:1 [23:09:41] But the 1:1 change won't merge unless I also +2 the 1:6.67 change [23:09:45] RoanKattouw: let me rebase, it should not depend on that patch [23:10:17] (03PS1) 10Smalyshev: Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) [23:10:43] (03CR) 10jerkins-bot: [V: 04-1] Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev) [23:11:44] (03PS2) 10Smalyshev: Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) [23:11:58] (03PS3) 10Bmansurov: Increase Schema:CitationUsage sampling rate to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441567 (https://phabricator.wikimedia.org/T191086) [23:12:08] RoanKattouw: done [23:12:35] (03CR) 10Catrope: [C: 032] Increase Schema:CitationUsage sampling rate to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441567 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [23:12:38] (03Abandoned) 10Bmansurov: Increase Schema:CitationUsage sampling rate to 15% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440867 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [23:13:14] (03CR) 10jerkins-bot: [V: 04-1] Increase Schema:CitationUsage sampling rate to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441567 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [23:13:42] hmm [23:13:51] (03Merged) 10jenkins-bot: Increase Schema:CitationUsage sampling rate to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441567 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [23:14:08] (03CR) 10jenkins-bot: Increase Schema:CitationUsage sampling rate to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441567 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [23:15:38] bmansurov: On mwdebug1002, please test [23:15:44] RoanKattouw: testing [23:16:55] 10Operations, 10SRE-Access-Requests, 10netops: Get Papaul access to network equipment - https://phabricator.wikimedia.org/T198344 (10ayounsi) 05Open>03Resolved Talked to Papaul on IRC, key push to asw-a/b/c/d-codfw and will be pushed progressively to more devices. I gave him a Juniper configuration and... [23:17:01] RoanKattouw: works [23:18:20] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Increase Schema:CitationUsage sampling rate to 100% (T191086) (duration: 00m 51s) [23:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:25] T191086: Instrument, collect data, and perform the first round of analysis on click-through data on citations/footnotes - https://phabricator.wikimedia.org/T191086 [23:18:49] 10Operations, 10ops-codfw, 10netops: Swith port information for authdns2001 - https://phabricator.wikimedia.org/T198126 (10Papaul) [edit interfaces interface-range vlan-public1-a-codfw] member ge-5/0/23 { ... } + member ge-5/0/5; [edit interfaces interface-range disabled] - member ge-5/... [23:19:31] bmansurov: Deployed [23:19:41] RoanKattouw: thank you! [23:20:51] (03PS3) 10Jforrester: Stop loading the MwEmbedSupport extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441518 [23:20:53] (03PS3) 10Jforrester: Stop loading the MwEmbedSupport extension, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441519 [23:20:55] (03PS3) 10Jforrester: Stop loading the MwEmbedSupport extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441520 [23:20:57] (03PS3) 10Jforrester: Stop loading the MwEmbedSupport extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441521 [23:21:48] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [23:25:09] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:47:15] OK so now Jenkins has finally merged my cherry-picks [23:54:06] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.10/resources/src/mediawiki.rcfilters/: Watchlist perf fixes (T198359, T198399) (duration: 00m 52s) [23:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:11] T198359: Reduce number of times we apply highlights - https://phabricator.wikimedia.org/T198359 [23:54:11] T198399: Avoid unnecessary calls to updateIfHeightChanged on page load when highlighting is in query params - https://phabricator.wikimedia.org/T198399 [23:56:39] is being able to see other people's dashboards intended? [23:56:49] *dashboards on gerrit [23:57:28] Yes. [23:59:57] yes